orenmel / context2vec Goto Github PK

View Code? Open in Web Editor NEW

216.0 216.0 60.0 40 KB

License: Apache License 2.0

Python 100.00%

context2vec's People

Contributors

Stargazers

Watchers

Forkers

binbinbian adrianhust benjamesbabala xsongx wanghaisheng tonylibing jankim little1tow shashankg7 manasrk deepinfinity jeahenng biu-nlp zencoding samuel2015 stevenbedrick bidexbido chienjchienj guhaifudeng unendin argus-iq calculatedcontent xiaoyidolly litetoooooom akbari59 ruimao1988 ufukhurriyetoglu jbdatascience panxiao1994 mdcclv harmanpreet93 qianchu ymfa stevenlol hellomotor ysenarath syx528911137 frankier auserj yoshikaz1228 gaochangkuan semanticbeeng scorpionhiccup traymihael amitrajitbose dengziheng poloma07 lilixinsniper ankitamandal afcarl greysun tkx68 zonation jamesbyrne113 peterqtr11 hamedmx abhisheksingh-unt bees4ever vedkhatri jcarlosneto

context2vec's Issues

AttributeError: 'module' object has no attribute 'EmbedID'

Hi,
I am trying to train a model on my corpus. So I did:

python context2vec/train/corpus_by_sent_length.py my_file

which went fine, and then:

python context2vec/train/train_context2vec.py -i my_file.DIR -c lstm --deep yes -t 3 --dropout 0.0 -u 300 -e 10 -p 0.75 -b 100 -g -1

(I don't have a GPU-enabled machine). Here's what I get:

GPU: -1
Minibatch-size: 100
Context type: lstm
Deep: True
Dropout: 0.0
Trimfreq: 3
NS Power: 0.75

n_vocab: 62690
corpus size: 95350988
Traceback (most recent call last):
File "context2vec/train/train_context2vec.py", line 114, in
model = BiLstmContext(args.deep, args.gpu, reader.word2index, context_word_units, lstm_hidden_units, target_word_units, loss_func, True, args.dropout)
File "/USER_DIR_NOT_SHOWN/anaconda2/envs/py27base/lib/python2.7/site-packages/context2vec/common/context_models.py", line 76, in init
l2r_embedding=F.EmbedID(n_vocab, in_units)
AttributeError: 'module' object has no attribute 'EmbedID'

Could you please help me understand how to make it work correctly? Thank you!

How to restore a model?

Is it ok if I add S.load_npz(model_file, model) after model = BiLstmContext(args.deep, args.gpu, reader.word2index, context_word_units, lstm_hidden_units, target_word_units, loss_func, True, args.dropout) in train_context2vec.py without using common.model_reader?
Thank you very much.

Negative Sampling

Hi
In the context of negative sampling the paper mentions 10 samples. This would suggest 10 independently sampled tokens for every word-context. In the code where the loss function calculation is called it seems the same sample of 10 tokens would be used for all the B=100 'first word' contexts, then a different sample of 10 tokens would be used for the next B=100 'second word' contexts and so on. Is my understanding correct?

Model Convergence Indicator

Hi,
During the training, how the number of epochs are getting decided ?
As I see, no part of training data has been reserved for evaluation during training of model. So, how can I be assured that model has converged during training ?

the link of the pre-trained model is dead

Hi, The link(http://u.cs.biu.ac.il/~nlp/resources/downloads/context2vec/) of the pre-trained model is dead. I would appreciate it if you could check and update it. Because I'm anxious to use it for related experiments. Thanks!

Sample code for usage

Hi,

I am trying to use the packages and I am a little confused. My task is to find the similarity scores between sentences.
Can you provide a sample code how can we use this in python code to train and test?

Can't download pre-trained models

Hi, the link for the pre-trained models is dead. Can you please check?

Got all nan after training 3 epoches

I trained a [context2vec].(https://github.com/orenmel/context2vec/blob/master/context2vec/train/train_context2vec.py)
I printed the context_v in explore_context2vec.py, and got all nan.
When using tensorflow, I can fix this by adding a small number to the loss function.

cross_entropy = self.target * tf.log(self.prediction+1e-10)
How to do this in chainer?

Training with a limited memory

Hi Oren,
I am training my 3GB corpus. I am doing it on the clusters that have 27 GB memory limitation. I encounter :
cupy.cuda.memory.OutOfMemoryError.

Is it possible in some way to limit the memory that code uses? Or split the corpus file and do the training in steps? Or changing some arguments to use less memory?

Thanks.

What's the epoch of context2vec model?

@orenmel
It is a little weird, I downlaoded the context2vec model learned from UkWac and tested on SE3. Its accuray is 72.8% with your wsd_main.py code. But if I ran your explore_context2vec.py to write the context vectors into a file and compute similarity by myself, the accuracy became 73.4%.
So what is the epoch of the model? None of the accuracy above is equal with your accuray in the paper.
Thank you very much.

Batch of context vectors with variable context length

I have an issue for use the method "context_rep(self, sent, position)" of BiLstmContext because it works only for batch size equals 1. What could be the best way to get batch of context vectors with variable context length?

Thanks you very much for your time

Why not cosine similarity?

In explore_context2vec.py, when got context_v and target_v, why use w again?
You use the w to get the context_v, didn't you?

def mult_sim(w, target_v, context_v):
    target_similarity = w.dot(target_v)
    target_similarity[target_similarity<0] = 0.0
    context_similarity = w.dot(context_v)
    context_similarity[context_similarity<0] = 0.0
    return (target_similarity * context_similarity)

What's the SE-3-dev?

I am doing research on SE-3 WSD. And I want do compare with your method.
So I want to use the same dev set as yours.
Thank you very much.

Tokenizer for using UKWac model

Dear creator (@orenmel),

I am currently using your great tool and the pre-trained UKWaC model to create context vectors for target words in Wikipedia sentences.

To my questions: 1. Do you use the default UKWaC tokenization as input for the training? 2. Which tokenizer should I use for new input at test time in order to make the best out of the model? (Do you have any suggestions?)

Best regards,
David

About 'context word embedding' in contexe2vec

Hi,
firstly sorry because I'm not a user of chainer, so want to directly ask this.
I'm just curious about did you train 'context word embedding' with initializing a lookup table or use other pre-trained word2vec without further fine-tuning?

I think it is the former, am I right?

CPU training

Not really an issue, but any advice for CPU-only training?
https://groups.google.com/forum/#!topic/chainer/vbkOdKaesPI

pre-train download link dosen't work

Hello,I Cannot download the pre-trained context2vec model from the following link:
https://u.cs.biu.ac.il/~nlp/resources/downloads/context2vec/
It got a wp-blog-header.php error message with WordPress.
Could you please help?

Context rep

I've read the article, it is really clear thanks for sharing your work.

But I did not find in the code, when you concatenate left and right context, where is the stripping of the target word? ... it seems to me that you use all the context's words to represent the i-word context including the i-word itselft. Of course I wrong because my python is not so good, can you explain also only with a simple comment, where this is done in code.

Thanks in advance.
v

Just get embedding for context and target word

Hi, I'm sorry if this is obvious, but how would I just get the embeddings for a context and for a potential target word?

For example, suppose my context is ['river', 'bank'] and my target is ['water']. How would I get vectors for each.....?