Giter Site home page Giter Site logo

clab / lstm-parser Goto Github PK

View Code? Open in Web Editor NEW
204.0 204.0 63.0 1.9 MB

Transition-based dependency parser based on stack LSTMs

License: Apache License 2.0

CMake 1.91% C++ 75.09% Cuda 1.13% Makefile 0.08% Python 14.02% Jupyter Notebook 7.74% Shell 0.03%

lstm-parser's People

Contributors

duncanka avatar miguelballesteros avatar redpony avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lstm-parser's Issues

cnn out of memory #2

Hello,

I am attempting to train using the parser with some embeddings I have created. The number of words is 5393907, and I am running with 100 dimensions. The training oracle file consists of 110MB of text, and the dev oracle is at 17MB. The embeddings file size is 944MB. Running the parser without my pretrained embeddings works smooth.

The computer I am using has 378GB of RAM and 40 processing units. I am running with --cnn-mem 200000 (200GB) and still getting the cnn out of memory/core dumped error. What can I do in order to run the parser with my embeddings and treebank? I cannot see how 200GB of RAM can be insufficient. Any help would be greatly appreciated.

Kind regards,
Henrik H. Løvold,
MSc student LTG group at the University of Oslo

Using lemma of a word in lstm-parse.cc file

Hi,

Is there a way to reach lemma of words in the treebank (maybe by extracting them from conll files) inside lstm-parse.cc? Is lemma information kept somewhere in the code?

Thanks,
Betul

Problem with the '-' character between word and postag in oracle txt files

Hi,

When using the command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

ParserOracleArcStdWithSwap.jar puts '-' character between words and their postags in the trainingOracle.txt file. However, in the current version of UD treebanks, some treebanks include xpos values that include multiple '-' characters. So, the oracle files look like this:

[][τὰ-DET_l-p---na-, γὰρ-ADV_d--------, πρὸ-ADP_r--------, αὐτῶν-PRON_p-p---ng-, καὶ-CCONJ_c--------, τὰ-DET_l-p---na-,..., ROOT-ROOT]

When these oracle files are being parsed in load_correct_actions and load_correct_actionsDev methods inside c2.h file, the words and their pos-tags cannot be extracted correctly.

Can it be possible to put another character like '#' between words and postags when creating the oracle txt files? I have tried to change the '-' character with '#' character by decompiling the class files inside ParserOracleArcStdWithSwap.jar but couldn't succeed it.

Thank you,

Betul

How does lstm-parser handle pre-trained word embeddings?

Hi,

First of all, thank you very much for sharing this excellent dependency parser. I am using your parser as part of my MSc thesis on improving dependency parsing of Norwegian language using pre-trained word embeddings, and I have a question regarding lstm-parser's handling of pre-trained embeddings which I cannot find the answer to in your article.

From my understanding, there are several different approaches to using pre-trained word embeddings in dependency parsers. UDPipe/Parsito has the option of either using the embeddings directly as feature vectors for each word in the vocabulary, or using the pre-trained embeddings as a basis for the training of internal form-embeddings used by the parser. This raises the question: how does lstm-parser utilise the pre-trained embeddings internally?

Kind regards,
Henrik H. Løvold
LTG group at Uni. of Oslo

Bad dimensions for AffineTransform exception in char-based branch

Running the training command from README.md on any oracle gives me this error:

Initializing...
Allocating memory...
Done.
COMMAND: ../build/parser/lstm-parse -T oracle.txt -d oracle.txt --hidden_dim 100 --lstm_input_dim 100 --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P -S
Unknown word strategy: STOCHASTIC REPLACEMENT
Maximum number of iterations: 8000
Writing parameters to file: parser_pos_2_32_100_20_100_12_20-pid10136.params
done
SHIFT
RIGHT-ARC(R)
LEFT-ARC(E)
LEFT-ARC(R)
LEFT-ARC(D)
LEFT-ARC(A)
LEFT-ARC(U)
RIGHT-ARC(U)
RIGHT-ARC(F)
LEFT-ARC(F)
RIGHT-ARC(L)
RIGHT-ARC(H)
RIGHT-ARC(E)
RIGHT-ARC(A)
RIGHT-ARC(T)
RIGHT-ARC(C)
RIGHT-ARC(N)
RIGHT-ARC(D)
RIGHT-ARC(G)
LEFT-ARC(ROOT)
nactions:20
nwords:272
0:
1:Word
2:Punctuation
Number of words: 272
Number of UTF8 chars: 67
Training started.
NUMBER OF TRAINING SENTENCES: 1
**SHUFFLE
Bad dimensions for AffineTransform: [{50} {50,100} {32,1} {50,50} {50} {50,50} {50}]
Abort

cnn is out of memory

When attempting to train a model, the program aborts with the report

cnn is out of memory, try increasing with --cnn-mem

However, this argument is not accepted by the jar. I don't know how to fix this. Any help would be great!

Cannot open shared object file 'libboost_program_options.so.1.65.1'

Hi,

I try to run the program on centos 7, and I don't have root privilege. The error message occurs when I execute ./parser/lstm-parse in the $HOME/lstm-parser/build directory:

./parser/lstm-parse: error while loading shared libraries: libboost_program_options.so.1.65.1: cannot open shared object file: No such file or directory

And the message during cmake .. -DEIGEN3_INCLUDE_DIR=$HOME/clib/eigen-eigen-5a0156e40feb/ is :

-- Boost version: 1.65.1
-- Found the following Boost libraries:
-- program_options
-- serialization
-- Configuring done
-- Generating done
-- Build files have been written to: $HOME/lstm-parser/build

make -j 20 was also successful (without any error information, only warning on comparison between singed and unsigned).

I don't know what should I do now, any help would be appreciated !

Parser does not output LAS

After training and testing, the parser outputs the unlabeled attachment score (UAS) of the hypothesis, but not the labeled attachment score (LAS). It can be calculated in a way similar to compute_correct.

train arguments are not correct please fix (for easy-to-use version)

this argument is not correct. not only incorrrect but also does nothing. you have no flag "pretrained_dim" at all.
parser/lstm-parse -T trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P

for interested users if you want to train you should supply this command. "-r" flag starts training.
build/parser/lstm-parse -t trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --rel_dim 20 --action_dim 20 -P -r

GPU support and meaning of ppl and llh

Hi,

I run the code, and it's already really fast. But I wonder if I can use GPU to speed up the training process? And if this is possible, what should I do?

And how to train the model without dynamic oracle?

By the way, what does 'llh xxx' and 'ppl xxx' means ?

[epoch=2 eta=0.0862069 clips=68 updates=100] update #325 (epoch 2.59906 |time=Fri Nov 10 19:35:08 2017 CST) llh: 704.437 ppl: 1.24006 err: 0.0610874
**dev (iter=325 epoch=2.59906) llh=0 ppl: 1 err: 1 uas: 0.86321 [2002 sents in 10265.1 ms]

Error when parsing multiword expressions in conllu file

Hi,

I am trying to train this parser on Turkish UD Treebank. When I run this command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

I got the following error:

java.lang.NumberFormatException: For input string: "2-3"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at arc_std_swap.Oracle.getTransition(Oracle.java:41)
        at arc_std_swap.Parser.printOracle(Parser.java:366)
        at arc_std_swap.Parser.main(Parser.java:270)

The conllu parse the lstm parser gives error is the one below:

# sent_id = mst-0003
# text = Sanal parçacıklarsa bunların hiçbirini yapamazlar.
1	Sanal	sanal	ADJ	Adj	_	2	amod	_	_
2-3	parçacıklarsa	_	_	_	_	_	_	_	_
2	parçacıklar	parçacık	NOUN	Noun	Case=Nom|Number=Plur|Person=3	6	csubj	_	_
3	sa	i	AUX	Zero	Aspect=Perf|Mood=Cnd|Number=Sing|Person=3|Tense=Pres	2	cop	_	_
4	bunların	bu	PRON	Demons	Case=Gen|Number=Plur|Person=3|PronType=Dem	5	nmod:poss	_	_
5	hiçbirini	hiçbiri	PRON	Quant	Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|PronType=Ind	6	obj	_	_
6	yapamazlar	yap	VERB	Verb	Aspect=Imp|Mood=Pot|Number=Plur|Person=3|Polarity=Neg|Tense=Aor	0	root	_	SpaceAfter=No
7	.	.	PUNCT	Punc	_	6	punct	_	_

The word 'parçacıklarsa' is a multiword token, so it is numbered as '2-3'. Does lstm parser have a mechanism to deal with multiword tokens? How can I solve this issue?

Thanks,

Betul

Serialization error while testing a trained model

Hi,

I have trained a parser model for Turkish. When I attempt to test this model with test data, I get the following error:

lstm-parse: /home/betul/lstm-parser/lstm-parser/cnn/cnn/model.h:92: void cnn::LookupParameters::load(Archive&, unsigned int) [with Archive = boost::archive::text_iarchive]: Assertion nv == (int)values.size()' failed.
Aborted
`
Do you have any idea why this error is shown?

Thanks,

Betul

Read zipped word vectors

Currently the -w option allows specifying a word embedding file in text. However, these files may be quite large and are often saved in .gz format, for example. When given a compressed word embedding file, the program should still read it, unzipping it first.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.