clab / lstm-parser Goto Github PK

View Code? Open in Web Editor NEW

204.0 204.0 63.0 1.9 MB

Transition-based dependency parser based on stack LSTMs

License: Apache License 2.0

CMake 1.91% C++ 75.09% Cuda 1.13% Makefile 0.08% Python 14.02% Jupyter Notebook 7.74% Shell 0.03%

lstm-parser's People

Contributors

Stargazers

Watchers

Forkers

miguelballesteros danielhers oplatek gouxiayibu ericxsun zhoujialinmumu iamtrask andresmo trevorcohn riyazbhat jolinxql lstmparser milesqli weilamchung little1tow hitluobin gged vikingmew pombredanne shannonyu yanyankangkang leoncrashcode duncanka liujianxing chongtang shaktisd treiden zhaoguochen kevinwenya codeaudit chagge volkancirik parserz speedcell4 miradel51 bensnw zgsxwsdxg wangzhaolang dreamliking phillette zhujiahui nunofernandes-plight liudan910 shubhampachori12110095 nanguoshun pvcastro qiangzhongwork daniyuu pedrocaldeira tangku006 sanyu12 ieremiasviorel ahanmr eric-seekas mayank127 vatile sb-b francisdingj standardgalactic zhenzhang30

lstm-parser's Issues

cnn out of memory #2

Hello,

I am attempting to train using the parser with some embeddings I have created. The number of words is 5393907, and I am running with 100 dimensions. The training oracle file consists of 110MB of text, and the dev oracle is at 17MB. The embeddings file size is 944MB. Running the parser without my pretrained embeddings works smooth.

The computer I am using has 378GB of RAM and 40 processing units. I am running with --cnn-mem 200000 (200GB) and still getting the cnn out of memory/core dumped error. What can I do in order to run the parser with my embeddings and treebank? I cannot see how 200GB of RAM can be insufficient. Any help would be greatly appreciated.

Kind regards,
Henrik H. Løvold,
MSc student LTG group at the University of Oslo

Using lemma of a word in lstm-parse.cc file

Hi,

Is there a way to reach lemma of words in the treebank (maybe by extracting them from conll files) inside lstm-parse.cc? Is lemma information kept somewhere in the code?

Thanks,
Betul

sskip.100.vectors vector training

On which corpa did you train sskip.100.vectors? Do you have any generator for creating skip-n-gram vectors?

Problem with the '-' character between word and postag in oracle txt files

Hi,

When using the command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

ParserOracleArcStdWithSwap.jar puts '-' character between words and their postags in the trainingOracle.txt file. However, in the current version of UD treebanks, some treebanks include xpos values that include multiple '-' characters. So, the oracle files look like this:

[][τὰ-DET_l-p---na-, γὰρ-ADV_d--------, πρὸ-ADP_r--------, αὐτῶν-PRON_p-p---ng-, καὶ-CCONJ_c--------, τὰ-DET_l-p---na-,..., ROOT-ROOT]

When these oracle files are being parsed in load_correct_actions and load_correct_actionsDev methods inside c2.h file, the words and their pos-tags cannot be extracted correctly.

Can it be possible to put another character like '#' between words and postags when creating the oracle txt files? I have tried to change the '-' character with '#' character by decompiling the class files inside ParserOracleArcStdWithSwap.jar but couldn't succeed it.

Thank you,

Betul

implementation & documentation on parsing unlabelled data

Any plan to to implement unlabelled data parsing?

Automatically stop training based on score or number of iterations

Currently the training has to be terminated manually. There should be an option to make it stop once it reaches a certain dev score, or after a given number of iterations.

How does lstm-parser handle pre-trained word embeddings?

Hi,

First of all, thank you very much for sharing this excellent dependency parser. I am using your parser as part of my MSc thesis on improving dependency parsing of Norwegian language using pre-trained word embeddings, and I have a question regarding lstm-parser's handling of pre-trained embeddings which I cannot find the answer to in your article.

From my understanding, there are several different approaches to using pre-trained word embeddings in dependency parsers. UDPipe/Parsito has the option of either using the embeddings directly as feature vectors for each word in the vocabulary, or using the pre-trained embeddings as a basis for the training of internal form-embeddings used by the parser. This raises the question: how does lstm-parser utilise the pre-trained embeddings internally?

Kind regards,
Henrik H. Løvold
LTG group at Uni. of Oslo

Bad dimensions for AffineTransform exception in char-based branch

Running the training command from README.md on any oracle gives me this error:

Initializing...
Allocating memory...
Done.
COMMAND: ../build/parser/lstm-parse -T oracle.txt -d oracle.txt --hidden_dim 100 --lstm_input_dim 100 --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P -S
Unknown word strategy: STOCHASTIC REPLACEMENT
Maximum number of iterations: 8000
Writing parameters to file: parser_pos_2_32_100_20_100_12_20-pid10136.params
done
SHIFT
RIGHT-ARC(R)
LEFT-ARC(E)
LEFT-ARC(R)
LEFT-ARC(D)
LEFT-ARC(A)
LEFT-ARC(U)
RIGHT-ARC(U)
RIGHT-ARC(F)
LEFT-ARC(F)
RIGHT-ARC(L)
RIGHT-ARC(H)
RIGHT-ARC(E)
RIGHT-ARC(A)
RIGHT-ARC(T)
RIGHT-ARC(C)
RIGHT-ARC(N)
RIGHT-ARC(D)
RIGHT-ARC(G)
LEFT-ARC(ROOT)
nactions:20
nwords:272
0:
1:Word
2:Punctuation
Number of words: 272
Number of UTF8 chars: 67
Training started.
NUMBER OF TRAINING SENTENCES: 1
**SHUFFLE
Bad dimensions for AffineTransform: [{50} {50,100} {32,1} {50,50} {50} {50,50} {50}]
Abort

cnn is out of memory

When attempting to train a model, the program aborts with the report

cnn is out of memory, try increasing with --cnn-mem

However, this argument is not accepted by the jar. I don't know how to fix this. Any help would be great!

Cannot open shared object file 'libboost_program_options.so.1.65.1'

Hi,

I try to run the program on centos 7, and I don't have root privilege. The error message occurs when I execute ./parser/lstm-parse in the $HOME/lstm-parser/build directory:

./parser/lstm-parse: error while loading shared libraries: libboost_program_options.so.1.65.1: cannot open shared object file: No such file or directory

And the message during cmake .. -DEIGEN3_INCLUDE_DIR=$HOME/clib/eigen-eigen-5a0156e40feb/ is :

-- Boost version: 1.65.1
-- Found the following Boost libraries:
-- program_options
-- serialization
-- Configuring done
-- Generating done
-- Build files have been written to: $HOME/lstm-parser/build

make -j 20 was also successful (without any error information, only warning on comparison between singed and unsigned).

I don't know what should I do now, any help would be appreciated !

Parser does not output LAS

After training and testing, the parser outputs the unlabeled attachment score (UAS) of the hypothesis, but not the labeled attachment score (LAS). It can be calculated in a way similar to compute_correct.

BOOST version used in the pre-trained model

Hi,

Which BOOST version was used when compiling the code to create the pre-trained model (in the easy-to-use branch)?

Thanks!
Roy

train arguments are not correct please fix (for easy-to-use version)

this argument is not correct. not only incorrrect but also does nothing. you have no flag "pretrained_dim" at all.
parser/lstm-parse -T trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P

for interested users if you want to train you should supply this command. "-r" flag starts training.
build/parser/lstm-parse -t trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --rel_dim 20 --action_dim 20 -P -r

GPU support and meaning of ppl and llh

Hi,

I run the code, and it's already really fast. But I wonder if I can use GPU to speed up the training process? And if this is possible, what should I do?

And how to train the model without dynamic oracle?

By the way, what does 'llh xxx' and 'ppl xxx' means ?

[epoch=2 eta=0.0862069 clips=68 updates=100] update #325 (epoch 2.59906 |time=Fri Nov 10 19:35:08 2017 CST) llh: 704.437 ppl: 1.24006 err: 0.0610874
**dev (iter=325 epoch=2.59906) llh=0 ppl: 1 err: 1 uas: 0.86321 [2002 sents in 10265.1 ms]

Error when parsing multiword expressions in conllu file

Hi,

I am trying to train this parser on Turkish UD Treebank. When I run this command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

I got the following error:

java.lang.NumberFormatException: For input string: "2-3"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at arc_std_swap.Oracle.getTransition(Oracle.java:41)
        at arc_std_swap.Parser.printOracle(Parser.java:366)
        at arc_std_swap.Parser.main(Parser.java:270)

The conllu parse the lstm parser gives error is the one below:

# sent_id = mst-0003
# text = Sanal parçacıklarsa bunların hiçbirini yapamazlar.
1	Sanal	sanal	ADJ	Adj	_	2	amod	_	_
2-3	parçacıklarsa	_	_	_	_	_	_	_	_
2	parçacıklar	parçacık	NOUN	Noun	Case=Nom|Number=Plur|Person=3	6	csubj	_	_
3	sa	i	AUX	Zero	Aspect=Perf|Mood=Cnd|Number=Sing|Person=3|Tense=Pres	2	cop	_	_
4	bunların	bu	PRON	Demons	Case=Gen|Number=Plur|Person=3|PronType=Dem	5	nmod:poss	_	_
5	hiçbirini	hiçbiri	PRON	Quant	Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|PronType=Ind	6	obj	_	_
6	yapamazlar	yap	VERB	Verb	Aspect=Imp|Mood=Pot|Number=Plur|Person=3|Polarity=Neg|Tense=Aor	0	root	_	SpaceAfter=No
7	.	.	PUNCT	Punc	_	6	punct	_	_

The word 'parçacıklarsa' is a multiword token, so it is numbered as '2-3'. Does lstm parser have a mechanism to deal with multiword tokens? How can I solve this issue?

Thanks,

Betul

Serialization error while testing a trained model

Hi,

I have trained a parser model for Turkish. When I attempt to test this model with test data, I get the following error:

lstm-parse: /home/betul/lstm-parser/lstm-parser/cnn/cnn/model.h:92: void cnn::LookupParameters::load(Archive&, unsigned int) [with Archive = boost::archive::text_iarchive]: Assertion nv == (int)values.size()' failed.
Aborted
`
Do you have any idea why this error is shown?

Thanks,

Betul

Read zipped word vectors

Currently the -w option allows specifying a word embedding file in text. However, these files may be quite large and are often saved in .gz format, for example. When given a compressed word embedding file, the program should still read it, unzipping it first.

clab / lstm-parser Goto Github PK

lstm-parser's People

Contributors

Stargazers

Watchers

Forkers

lstm-parser's Issues

Recommend Projects

Recommend Topics

Recommend Org