Giter Site home page Giter Site logo

question about data format about hmeae HOT 6 CLOSED

thunlp avatar thunlp commented on July 29, 2024
question about data format

from hmeae.

Comments (6)

wzq016 avatar wzq016 commented on July 29, 2024

Thank you for your interest to our paper!

loader.maxlen: max length of sentence in all instances, an integer. e.g. length of ['The', 'tank', 'Attacked', 'that', 'hotel'] is 5.
loader.max_argument_len: max length of arguments in all instances, an integer. e.g. entity ['The', 'tank'] has a length of 2.
loader.word_emb: embedding matrix with shape [num_of_words_in_pretrained_model , embedding_dims].

t_data is a list with 3 elements, which means training set, dev set and test set. The element type is a tuple, which contains 6 arrays. In following explanation, number_of_trigger_instance means the number of instance we use in trigger classification, note this is different from the number of sentences, because every token could be trained. number_of_argument_instance is similar, every combination of (token, entity) is a instance.

1st array with shape [number_of_trigger_instance, loader.maxlen] gives words' positions relative to trigger words.
2nd array gives sentence with word index, shape with [number_of_trigger_instance, loader.maxlen]. 3rd array gives masks relative to triggers, the words left of triggers is 1, others is 0, shape with [number_of_trigger_instance, loader.maxlen].
4th is similar to 3rd, but with the words right of triggers is 1, others is 0.
5th gives event types of all instances, with shape [number_of_trigger_instance].
6th is a shape of [number_of_trigger_instance, 3], each row gives word index of the word right before trigger word, trigger word and the word right after trigger word.

a_data is a list of 3 elements, which means training set, dev set and test set. The element type is a tuple, which contains 12 arrays. In a_data, we use number_of_argument_instance instead of number_of_trigger_instance.

1st array's meaning is same as 2nd array of t_data.
2nd array's meaning is same as 5th array of t_data.
3rd data gives role of specific entity, with shape [number_of_argument_instance].
4th, 6th arrays are similar as 3rd,4th arrays in t_data, but now the mask is to argument rather than trigger.
5th arrays means mask of arguments, which means the argument words would be 1, others would be 0. E.g. for sentence ['Yesterday','the', 'tank', 'Attacked', 'that', 'hotel'], for entity "the tank", 4th array [1,0,0,0,0,0], 5th array [0,0,0,1,1,1], 6th array [0,11,0,0,0].
7th array is same as 6th array in t_data.
8th array similar to 7th array, but this time we use argument words instead of trigger words, with shape [number_of_argument_instance, loader.max_argument_len+2]
9,10,11th arrays are same as 3rd,4th,1st array is t_data.
12th array is similar to 11th array, but now we use arguments as position reference rather than trigger.

Hope this can help you.

from hmeae.

SaintLogos1234 avatar SaintLogos1234 commented on July 29, 2024

Thank you very much for your reply, I have a question to ask:

posi_mat = tf.concat(
                            [tf.zeros([1, constant.posi_embedding_dim],tf.float32),
                            tf.get_variable('posi_emb',[2*maxlen,constant.posi_embedding_dim],tf.float32,initializer=tf.contrib.layers.xavier_initializer())],axis=0)

why position embedding is inited in this way

from hmeae.

SaintLogos1234 avatar SaintLogos1234 commented on July 29, 2024

I have another question: a sentences contain two trigger, the number of trigger instances is two, Is it like this?

from hmeae.

wzq016 avatar wzq016 commented on July 29, 2024

yes. In trigger classification stage, each token is a trigger candidate because every word has a probability to be a trigger, i.e. one sentence will have len(sentence) (counted by tokens) trigger instances.

from hmeae.

wzq016 avatar wzq016 commented on July 29, 2024

tf.zeros is for padding word, since we don't care padding words' position embedding.
the other one has a shape [2*maxlen,constant.posi_embedding_dim]. In this matrix, if a word has a distance from trigger word dist , then its position embedding index is maxlen+dist.

could refer to this function:

def get_positions(self,start_idx,sent_len,maxlen):
        
return list(range(maxlen-start_idx, maxlen)) + [maxlen]  + list(range(maxlen+1, maxlen+sent_len - start_idx))+[0]*(maxlen-sent_len)

from hmeae.

wzq016 avatar wzq016 commented on July 29, 2024

Hi, if you don't have other questions, I will close this issue :).

from hmeae.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.