clab / lstm-parser Goto Github PK
View Code? Open in Web Editor NEWTransition-based dependency parser based on stack LSTMs
License: Apache License 2.0
Transition-based dependency parser based on stack LSTMs
License: Apache License 2.0
Hello,
I am attempting to train using the parser with some embeddings I have created. The number of words is 5393907, and I am running with 100 dimensions. The training oracle file consists of 110MB of text, and the dev oracle is at 17MB. The embeddings file size is 944MB. Running the parser without my pretrained embeddings works smooth.
The computer I am using has 378GB of RAM and 40 processing units. I am running with --cnn-mem 200000 (200GB) and still getting the cnn out of memory/core dumped error. What can I do in order to run the parser with my embeddings and treebank? I cannot see how 200GB of RAM can be insufficient. Any help would be greatly appreciated.
Kind regards,
Henrik H. Løvold,
MSc student LTG group at the University of Oslo
Hi,
Is there a way to reach lemma of words in the treebank (maybe by extracting them from conll files) inside lstm-parse.cc? Is lemma information kept somewhere in the code?
Thanks,
Betul
On which corpa did you train sskip.100.vectors? Do you have any generator for creating skip-n-gram vectors?
Hi,
When using the command:
java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt
ParserOracleArcStdWithSwap.jar puts '-' character between words and their postags in the trainingOracle.txt file. However, in the current version of UD treebanks, some treebanks include xpos values that include multiple '-' characters. So, the oracle files look like this:
[][τὰ-DET_l-p---na-, γὰρ-ADV_d--------, πρὸ-ADP_r--------, αὐτῶν-PRON_p-p---ng-, καὶ-CCONJ_c--------, τὰ-DET_l-p---na-,..., ROOT-ROOT]
When these oracle files are being parsed in load_correct_actions and load_correct_actionsDev methods inside c2.h file, the words and their pos-tags cannot be extracted correctly.
Can it be possible to put another character like '#' between words and postags when creating the oracle txt files? I have tried to change the '-' character with '#' character by decompiling the class files inside ParserOracleArcStdWithSwap.jar but couldn't succeed it.
Thank you,
Betul
Any plan to to implement unlabelled data parsing?
Currently the training has to be terminated manually. There should be an option to make it stop once it reaches a certain dev score, or after a given number of iterations.
Hi,
First of all, thank you very much for sharing this excellent dependency parser. I am using your parser as part of my MSc thesis on improving dependency parsing of Norwegian language using pre-trained word embeddings, and I have a question regarding lstm-parser's handling of pre-trained embeddings which I cannot find the answer to in your article.
From my understanding, there are several different approaches to using pre-trained word embeddings in dependency parsers. UDPipe/Parsito has the option of either using the embeddings directly as feature vectors for each word in the vocabulary, or using the pre-trained embeddings as a basis for the training of internal form-embeddings used by the parser. This raises the question: how does lstm-parser utilise the pre-trained embeddings internally?
Kind regards,
Henrik H. Løvold
LTG group at Uni. of Oslo
Running the training command from README.md
on any oracle gives me this error:
Initializing...
Allocating memory...
Done.
COMMAND: ../build/parser/lstm-parse -T oracle.txt -d oracle.txt --hidden_dim 100 --lstm_input_dim 100 --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P -S
Unknown word strategy: STOCHASTIC REPLACEMENT
Maximum number of iterations: 8000
Writing parameters to file: parser_pos_2_32_100_20_100_12_20-pid10136.params
done
SHIFT
RIGHT-ARC(R)
LEFT-ARC(E)
LEFT-ARC(R)
LEFT-ARC(D)
LEFT-ARC(A)
LEFT-ARC(U)
RIGHT-ARC(U)
RIGHT-ARC(F)
LEFT-ARC(F)
RIGHT-ARC(L)
RIGHT-ARC(H)
RIGHT-ARC(E)
RIGHT-ARC(A)
RIGHT-ARC(T)
RIGHT-ARC(C)
RIGHT-ARC(N)
RIGHT-ARC(D)
RIGHT-ARC(G)
LEFT-ARC(ROOT)
nactions:20
nwords:272
0:
1:Word
2:Punctuation
Number of words: 272
Number of UTF8 chars: 67
Training started.
NUMBER OF TRAINING SENTENCES: 1
**SHUFFLE
Bad dimensions for AffineTransform: [{50} {50,100} {32,1} {50,50} {50} {50,50} {50}]
Abort
When attempting to train a model, the program aborts with the report
cnn is out of memory, try increasing with --cnn-mem
However, this argument is not accepted by the jar. I don't know how to fix this. Any help would be great!
Hi,
I try to run the program on centos 7, and I don't have root privilege. The error message occurs when I execute ./parser/lstm-parse
in the $HOME/lstm-parser/build
directory:
./parser/lstm-parse: error while loading shared libraries: libboost_program_options.so.1.65.1: cannot open shared object file: No such file or directory
And the message during cmake .. -DEIGEN3_INCLUDE_DIR=$HOME/clib/eigen-eigen-5a0156e40feb/
is :
-- Boost version: 1.65.1
-- Found the following Boost libraries:
-- program_options
-- serialization
-- Configuring done
-- Generating done
-- Build files have been written to: $HOME/lstm-parser/build
make -j 20
was also successful (without any error information, only warning on comparison between singed and unsigned).
I don't know what should I do now, any help would be appreciated !
After training and testing, the parser outputs the unlabeled attachment score (UAS) of the hypothesis, but not the labeled attachment score (LAS). It can be calculated in a way similar to compute_correct
.
Hi,
Which BOOST version was used when compiling the code to create the pre-trained model (in the easy-to-use branch)?
Thanks!
Roy
this argument is not correct. not only incorrrect but also does nothing. you have no flag "pretrained_dim" at all.
parser/lstm-parse -T trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P
for interested users if you want to train you should supply this command. "-r" flag starts training.
build/parser/lstm-parse -t trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --rel_dim 20 --action_dim 20 -P -r
Hi,
I run the code, and it's already really fast. But I wonder if I can use GPU to speed up the training process? And if this is possible, what should I do?
And how to train the model without dynamic oracle?
By the way, what does 'llh xxx' and 'ppl xxx' means ?
[epoch=2 eta=0.0862069 clips=68 updates=100] update #325 (epoch 2.59906 |time=Fri Nov 10 19:35:08 2017 CST) llh: 704.437 ppl: 1.24006 err: 0.0610874
**dev (iter=325 epoch=2.59906) llh=0 ppl: 1 err: 1 uas: 0.86321 [2002 sents in 10265.1 ms]
Hi,
I am trying to train this parser on Turkish UD Treebank. When I run this command:
java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt
I got the following error:
java.lang.NumberFormatException: For input string: "2-3"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at arc_std_swap.Oracle.getTransition(Oracle.java:41)
at arc_std_swap.Parser.printOracle(Parser.java:366)
at arc_std_swap.Parser.main(Parser.java:270)
The conllu parse the lstm parser gives error is the one below:
# sent_id = mst-0003
# text = Sanal parçacıklarsa bunların hiçbirini yapamazlar.
1 Sanal sanal ADJ Adj _ 2 amod _ _
2-3 parçacıklarsa _ _ _ _ _ _ _ _
2 parçacıklar parçacık NOUN Noun Case=Nom|Number=Plur|Person=3 6 csubj _ _
3 sa i AUX Zero Aspect=Perf|Mood=Cnd|Number=Sing|Person=3|Tense=Pres 2 cop _ _
4 bunların bu PRON Demons Case=Gen|Number=Plur|Person=3|PronType=Dem 5 nmod:poss _ _
5 hiçbirini hiçbiri PRON Quant Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|PronType=Ind 6 obj _ _
6 yapamazlar yap VERB Verb Aspect=Imp|Mood=Pot|Number=Plur|Person=3|Polarity=Neg|Tense=Aor 0 root _ SpaceAfter=No
7 . . PUNCT Punc _ 6 punct _ _
The word 'parçacıklarsa' is a multiword token, so it is numbered as '2-3'. Does lstm parser have a mechanism to deal with multiword tokens? How can I solve this issue?
Thanks,
Betul
Hi,
I have trained a parser model for Turkish. When I attempt to test this model with test data, I get the following error:
lstm-parse: /home/betul/lstm-parser/lstm-parser/cnn/cnn/model.h:92: void cnn::LookupParameters::load(Archive&, unsigned int) [with Archive = boost::archive::text_iarchive]: Assertion
nv == (int)values.size()' failed.
Aborted
`
Do you have any idea why this error is shown?
Thanks,
Betul
Currently the -w
option allows specifying a word embedding file in text. However, these files may be quite large and are often saved in .gz
format, for example. When given a compressed word embedding file, the program should still read it, unzipping it first.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.