I don't have the ACE2005 dataset，so I don't know the exact format of your data input，S

Thank you very much for your reply， I have a question to ask: <div class="snippet-

question about data format about hmeae HOT 6 CLOSED

thunlp commented on July 29, 2024

question about data format

from hmeae.

Comments (6)

wzq016 commented on July 29, 2024

Thank you for your interest to our paper!

loader.maxlen: max length of sentence in all instances, an integer. e.g. length of ['The', 'tank', 'Attacked', 'that', 'hotel'] is 5.
loader.max_argument_len: max length of arguments in all instances, an integer. e.g. entity ['The', 'tank'] has a length of 2.
loader.word_emb: embedding matrix with shape [num_of_words_in_pretrained_model , embedding_dims].

t_data is a list with 3 elements, which means training set, dev set and test set. The element type is a tuple, which contains 6 arrays. In following explanation, number_of_trigger_instance means the number of instance we use in trigger classification, note this is different from the number of sentences, because every token could be trained. number_of_argument_instance is similar, every combination of (token, entity) is a instance.

1st array with shape [number_of_trigger_instance, loader.maxlen] gives words' positions relative to trigger words.
2nd array gives sentence with word index, shape with [number_of_trigger_instance, loader.maxlen]. 3rd array gives masks relative to triggers, the words left of triggers is 1, others is 0, shape with [number_of_trigger_instance, loader.maxlen].
4th is similar to 3rd, but with the words right of triggers is 1, others is 0.
5th gives event types of all instances, with shape [number_of_trigger_instance].
6th is a shape of [number_of_trigger_instance, 3], each row gives word index of the word right before trigger word, trigger word and the word right after trigger word.

a_data is a list of 3 elements, which means training set, dev set and test set. The element type is a tuple, which contains 12 arrays. In a_data, we use number_of_argument_instance instead of number_of_trigger_instance.

1st array's meaning is same as 2nd array of t_data.
2nd array's meaning is same as 5th array of t_data.
3rd data gives role of specific entity, with shape [number_of_argument_instance].
4th, 6th arrays are similar as 3rd,4th arrays in t_data, but now the mask is to argument rather than trigger.
5th arrays means mask of arguments, which means the argument words would be 1, others would be 0. E.g. for sentence ['Yesterday','the', 'tank', 'Attacked', 'that', 'hotel'], for entity "the tank", 4th array [1,0,0,0,0,0], 5th array [0,0,0,1,1,1], 6th array [0,11,0,0,0].
7th array is same as 6th array in t_data.
8th array similar to 7th array, but this time we use argument words instead of trigger words, with shape [number_of_argument_instance, loader.max_argument_len+2]
9,10,11th arrays are same as 3rd,4th,1st array is t_data.
12th array is similar to 11th array, but now we use arguments as position reference rather than trigger.

Hope this can help you.

from hmeae.

SaintLogos1234 commented on July 29, 2024

Thank you very much for your reply， I have a question to ask:

posi_mat = tf.concat(
                            [tf.zeros([1, constant.posi_embedding_dim],tf.float32),
                            tf.get_variable('posi_emb',[2*maxlen,constant.posi_embedding_dim],tf.float32,initializer=tf.contrib.layers.xavier_initializer())],axis=0)

why position embedding is inited in this way

from hmeae.

SaintLogos1234 commented on July 29, 2024

I have another question: a sentences contain two trigger, the number of trigger instances is two, Is it like this?

from hmeae.

wzq016 commented on July 29, 2024

yes. In trigger classification stage, each token is a trigger candidate because every word has a probability to be a trigger, i.e. one sentence will have len(sentence) (counted by tokens) trigger instances.

from hmeae.

wzq016 commented on July 29, 2024

tf.zeros is for padding word, since we don't care padding words' position embedding.
the other one has a shape [2*maxlen,constant.posi_embedding_dim]. In this matrix, if a word has a distance from trigger word dist , then its position embedding index is maxlen+dist.

could refer to this function:

def get_positions(self,start_idx,sent_len,maxlen):
        
return list(range(maxlen-start_idx, maxlen)) + [maxlen]  + list(range(maxlen+1, maxlen+sent_len - start_idx))+[0]*(maxlen-sent_len)

from hmeae.

wzq016 commented on July 29, 2024

Hi, if you don't have other questions, I will close this issue :).

from hmeae.

question about data format about hmeae HOT 6 CLOSED

Comments (6)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent