orenmel / context2vec Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Hi,
I am trying to train a model on my corpus. So I did:
python context2vec/train/corpus_by_sent_length.py my_file
which went fine, and then:
python context2vec/train/train_context2vec.py -i my_file.DIR -c lstm --deep yes -t 3 --dropout 0.0 -u 300 -e 10 -p 0.75 -b 100 -g -1
(I don't have a GPU-enabled machine). Here's what I get:
GPU: -1
Minibatch-size: 100
Context type: lstm
Deep: True
Dropout: 0.0
Trimfreq: 3
NS Power: 0.75
n_vocab: 62690
corpus size: 95350988
Traceback (most recent call last):
File "context2vec/train/train_context2vec.py", line 114, in
model = BiLstmContext(args.deep, args.gpu, reader.word2index, context_word_units, lstm_hidden_units, target_word_units, loss_func, True, args.dropout)
File "/USER_DIR_NOT_SHOWN/anaconda2/envs/py27base/lib/python2.7/site-packages/context2vec/common/context_models.py", line 76, in init
l2r_embedding=F.EmbedID(n_vocab, in_units)
AttributeError: 'module' object has no attribute 'EmbedID'
Could you please help me understand how to make it work correctly? Thank you!
Is it ok if I add S.load_npz(model_file, model)
after model = BiLstmContext(args.deep, args.gpu, reader.word2index, context_word_units, lstm_hidden_units, target_word_units, loss_func, True, args.dropout)
in train_context2vec.py without using common.model_reader?
Thank you very much.
Hi
In the context of negative sampling the paper mentions 10 samples. This would suggest 10 independently sampled tokens for every word-context. In the code where the loss function calculation is called it seems the same sample of 10 tokens would be used for all the B=100 'first word' contexts, then a different sample of 10 tokens would be used for the next B=100 'second word' contexts and so on. Is my understanding correct?
Hi,
During the training, how the number of epochs are getting decided ?
As I see, no part of training data has been reserved for evaluation during training of model. So, how can I be assured that model has converged during training ?
Hi, The link(http://u.cs.biu.ac.il/~nlp/resources/downloads/context2vec/) of the pre-trained model is dead. I would appreciate it if you could check and update it. Because I'm anxious to use it for related experiments. Thanks!
Hi,
I am trying to use the packages and I am a little confused. My task is to find the similarity scores between sentences.
Can you provide a sample code how can we use this in python code to train and test?
Hi, the link for the pre-trained models is dead. Can you please check?
I trained a [context2vec].(https://github.com/orenmel/context2vec/blob/master/context2vec/train/train_context2vec.py)
I printed the context_v in explore_context2vec.py, and got all nan.
When using tensorflow, I can fix this by adding a small number to the loss function.
cross_entropy = self.target * tf.log(self.prediction+1e-10)
How to do this in chainer?
Hi Oren,
I am training my 3GB corpus. I am doing it on the clusters that have 27 GB memory limitation. I encounter :
cupy.cuda.memory.OutOfMemoryError.
Is it possible in some way to limit the memory that code uses? Or split the corpus file and do the training in steps? Or changing some arguments to use less memory?
Thanks.
@orenmel
It is a little weird, I downlaoded the context2vec model learned from UkWac and tested on SE3. Its accuray is 72.8% with your wsd_main.py code. But if I ran your explore_context2vec.py to write the context vectors into a file and compute similarity by myself, the accuracy became 73.4%.
So what is the epoch of the model? None of the accuracy above is equal with your accuray in the paper.
Thank you very much.
I have an issue for use the method "context_rep(self, sent, position)" of BiLstmContext because it works only for batch size equals 1. What could be the best way to get batch of context vectors with variable context length?
Thanks you very much for your time
In explore_context2vec.py, when got context_v and target_v, why use w again?
You use the w to get the context_v, didn't you?
def mult_sim(w, target_v, context_v):
target_similarity = w.dot(target_v)
target_similarity[target_similarity<0] = 0.0
context_similarity = w.dot(context_v)
context_similarity[context_similarity<0] = 0.0
return (target_similarity * context_similarity)
I am doing research on SE-3 WSD. And I want do compare with your method.
So I want to use the same dev set as yours.
Thank you very much.
Dear creator (@orenmel),
I am currently using your great tool and the pre-trained UKWaC model to create context vectors for target words in Wikipedia sentences.
To my questions: 1. Do you use the default UKWaC tokenization as input for the training? 2. Which tokenizer should I use for new input at test time in order to make the best out of the model? (Do you have any suggestions?)
Best regards,
David
Hi,
firstly sorry because I'm not a user of chainer, so want to directly ask this.
I'm just curious about did you train 'context word embedding' with initializing a lookup table or use other pre-trained word2vec without further fine-tuning?
I think it is the former, am I right?
Not really an issue, but any advice for CPU-only training?
https://groups.google.com/forum/#!topic/chainer/vbkOdKaesPI
Hello,I Cannot download the pre-trained context2vec model from the following link:
https://u.cs.biu.ac.il/~nlp/resources/downloads/context2vec/
It got a wp-blog-header.php error message with WordPress.
Could you please help?
I've read the article, it is really clear thanks for sharing your work.
But I did not find in the code, when you concatenate left and right context, where is the stripping of the target word? ... it seems to me that you use all the context's words to represent the i-word context including the i-word itselft. Of course I wrong because my python is not so good, can you explain also only with a simple comment, where this is done in code.
Thanks in advance.
v
Hi, I'm sorry if this is obvious, but how would I just get the embeddings for a context and for a potential target word?
For example, suppose my context is ['river', 'bank'] and my target is ['water']. How would I get vectors for each.....?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.