doodlejz / hpsg-neural-parser Goto Github PK

View Code? Open in Web Editor NEW

106.0 4.0 25.0 6.93 MB

Source code for "Head-Driven Phrase Structure Grammar Parsing on Penn Treebank" published at ACL 2019

Home Page: https://arxiv.org/abs/1907.02684

License: MIT License

Makefile 0.01% Scilab 0.34% C 3.33% Perl 0.01% Shell 0.19% Python 96.13%

nlp parsing machine-learning syntactic-parsing parser natural-language-processing hpsg-neural-parser bert xlnet elmo

hpsg-neural-parser's Introduction

HPSG Neural Parser

This is a Python implementation of the parsers described in "Head-Driven Phrase Structure Grammar Parsing on Penn Treebank" from ACL 2019.

Requirements
Training
Citation
Credits

Requirements

Python 3.6 or higher.
Cython 0.25.2 or any compatible version.
PyTorch 0.4.0. This code has not been tested with PyTorch 1.0, but it should work.
EVALB. Before starting, run make inside the EVALB/ directory to compile an evalb executable. This will be called from Python for evaluation.
AllenNLP 0.7.0 or any compatible version (only required when using ELMo word representations)
pytorch-transformers PyTorch 1.0.0+ or any compatible version (only required when using BERT and XLNet, XLNet only for joint span version.)

Pre-trained Models (PyTorch)

The following pre-trained parser models are available for download:

joint_cwt_best_dev=93.85_devuas=95.87_devlas=94.47.pt: Our best English single-system parser based on Glove.
joint_bert_dev=95.55_devuas=96.67_devlas=94.86.pt: Our best English single-system parser based on BERT.
joint_xlnet_dev=96.03_devuas=96.96_devlas=95.32.pt: Our best English single-system parser based on XLNet.

The pre-trained model with Glove embeddings obtains 93.78 F-scores of constituent parsing and 96.09 UAS, 94.68 LAS of dependency parsing on the test set.

The pre-trained model with BERT obtains 95.84 F-scores of constituent parsing and 97.00 UAS, 95.43 LAS of dependency parsing on the test set.

The pre-trained model with XLNet obtains 96.33 F-scores of constituent parsing and 97.20 UAS, 95.72 LAS of dependency parsing on the test set.

To use ELMo embeddings, download the following files into the data/ folder (preserving their names):

There is currently no command-line option for configuring the locations/names of the ELMo files.

Pre-trained BERT and XLNet weights will be automatically downloaded as needed by the pytorch-transformers package.

Training

Download the 3 PTB data files from https://github.com/nikitakit/self-attentive-parser/tree/master/data, and put them in the data/ folder. The dependency structures are mainly obtained by converting constituent structure with version 3.3.0 of Stanford Parser in the data/ folder:

java -cp stanford-parser_3.3.0.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile 02-21.10way.clean > ptb_train_3.3.0.sd

For CTB, we use the same datasets and preprocessing from the Distance Parser. For PTB, we use the same datasets and preprocessing from the self-attentive-parser. GloVe embeddings are optional.

Training Instructions

Some of the available arguments are:

Argument	Description	Default
`--model-path-base`	Path base to use for saving models	N/A
`--evalb-dir`	Path to EVALB directory	`EVALB/`
`--train-ptb-path`	Path to training constituent parsing	`data/02-21.10way.clean`
`--dev-ptb-path`	Path to development constituent parsing	`data/22.auto.clean`
`--dep-train-ptb-path`	Path to training dependency parsing	`data/ptb_train_3.3.0.sd`
`--dep-dev-ptb-path`	Path to development dependency parsing	`data/ptb_dev_3.3.0.sd`
`--batch-size`	Number of examples per training update	250
`--checks-per-epoch`	Number of development evaluations per epoch	4
`--subbatch-max-tokens`	Maximum number of words to process in parallel while training (a full batch may not fit in GPU memory)	2000
`--eval-batch-size`	Number of examples to process in parallel when evaluating on the development set	30
`--numpy-seed`	NumPy random seed	Random
`--use-words`	Use learned word embeddings	Do not use word embeddings
`--use-tags`	Use predicted part-of-speech tags as input	Do not use predicted tags
`--use-chars-lstm`	Use learned CharLSTM word representations	Do not use CharLSTM
`--use-elmo`	Use pre-trained ELMo word representations	Do not use ELMo
`--use-bert`	Use pre-trained BERT word representations	Do not use BERT
`--use-xlnet`	Use pre-trained XLNet word representations	Do not use XLNet
`--pad-left`	When using pre-trained XLNet padding on left	Do not pad on left
`--bert-model`	Pre-trained BERT model to use if `--use-bert` is passed	`bert-large-uncased`
`--no-bert-do-lower-case`	Instructs the BERT tokenizer to retain case information (setting should match the BERT model in use)	Perform lowercasing
`--xlnet-model`	Pre-trained XLNet model to use if `--use-xlnet` is passed	`xlnet-large-cased`
`--no-xlnet-do-lower-case`	Instructs the XLNet tokenizer to retain case information (setting should match the XLNet model in use)	Perform uppercasing
`--const-lada`	Lambda weight	0.5
`--model-name`	Name of model	test
`--embedding-path`	Path to pre-trained embedding	N/A
`--embedding-type`	Pre-trained embedding type	glove
`--dataset`	Dataset type	ptb

Additional arguments are available for other hyperparameters; see make_hparams() in src/main.py. These can be specified on the command line, such as --num-layers 2 (for numerical parameters), --use-tags (for boolean parameters that default to False), or --no-partitioned (for boolean parameters that default to True).

For each development evaluation, the best_dev_score is the sum of F-score and LAS on the development set and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development best_dev_score.

As an example, after setting the paths for data and embeddings, to train a Joint-Span parser, simply run:

sh run_single.sh

to train a Joint-Span parser with BERT, simply run:

sh run_bert.sh

to train a Joint-Span parser with XLNet, simply run:

sh run_xlnet.sh

Evaluation Instructions

A saved model can be evaluated on a test corpus using the command python src/main.py test ... with the following arguments:

Argument	Description	Default
`--model-path-base`	Path base of saved model	N/A
`--evalb-dir`	Path to EVALB directory	`EVALB/`
`--test-ptb-path`	Path to test constituent parsing	`data/23.auto.clean`
`--dep-test-ptb-path`	Path to test dependency parsing	`data/ptb_test_3.3.0.sd`
`--embedding-path`	Path to pre-trained embedding	`data/glove.6B.100d.txt.gz`
`--eval-batch-size`	Number of examples to process in parallel when evaluating on the test set	100
`--dataset`	Dataset type	ptb

As an example, after extracting the pre-trained model, you can evaluate it on the test set using the following command:

sh test.sh

If you want to parse the sentences, after setting the input file and pre-trained model, run following command:

sh parse.sh

Citation

If you use this software for research, please cite our paper as follows:

@inproceedings{zhou-zhao-2019-head,
    title = "Head-Driven Phrase Structure Grammar Parsing on {P}enn Treebank",
    author = "Zhou, Junru  and Zhao, Hai",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
}

Credits

The code in this repository and portions of this README are based on https://github.com/nikitakit/self-attentive-parser

hpsg-neural-parser's People

Contributors

Stargazers

Watchers

hpsg-neural-parser's Issues

Error in preprocessing data/training CTB

Hi, I met an error in training with CTB dataset.

I download ctb8.0 from the link and preprocess the data following the guide of Github repo distance parser step 3 and get train/dev/test.txt. Then I transformed the three txt files with Stanford parser 3.3.0 to dependency structure and get train/dev/test.conll.

Running sh run_single.sh, I got an error:

Traceback (most recent call last):
  File "src_joint/main.py", line 746, in <module>
    main()
  File "src_joint/main.py", line 742, in main
    args.callback(args)
  File "src_joint/main.py", line 688, in <lambda>
    subparser.set_defaults(callback=lambda args: run_train(args, hparams))
  File "src_joint/main.py", line 225, in run_train
    train_parse = [tree.convert() for tree in train_treebank]
  File "src_joint/main.py", line 225, in <listcomp>
    train_parse = [tree.convert() for tree in train_treebank]
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 93, in convert
    children.append(child.convert(index = index))
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 93, in convert
    children.append(child.convert(index = index))
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 93, in convert
    children.append(child.convert(index = index))
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 80, in convert
    assert sub_children[-1].right == sub_child.left, str(sub_children[-1].right)+'\t'+str(sub_child.left) #contiune span
AssertionError:

Look forward to your reply.

Auto Tags Issue

Hi, I'm a little bit confused if 02-21.10way.clean is auto tagged, how did you measure the constituency parsing score? I checked the 3th and 4th columns are the same. If they are auto tagged then you don't have gold truth for constituency parsing, otherwise the dependency parsing score is biased.

UNK in produced parses

I see that the POS predictions (produced by parse.sh) are all UNKs for the constituency output. Can the model generate POS tags?

No function pred_linearize

Loading model from models/joint_bert_dev=95.55_devuas=96.67_devlas=94.86.pt1...
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Parsing sentences...
Traceback (most recent call last):
File "src_joint/main.py", line 746, in
main()
File "src_joint/main.py", line 742, in main
args.callback(args)
File "src_joint/main.py", line 680, in run_parse
save_data(syntree_pred, cun)
File "src_joint/main.py", line 651, in save_data
output_file.write("{}\n".format(tree.pred_linearize()))
AttributeError: 'InternalParseNode' object has no attribute 'pred_linearize'

No such file or directory: 'data/glove.gz'

Can you provide the glove.gz ? Thx :)

Question on the Dependency Training Objective.

[null]

Test set accuracy is very low

Loading model from models/joint_bert_dev=95.55_devuas=96.67_devlas=94.86.pt...
Reading dependency parsing data from data/ptb_test_3.3.0.sd
Loading test trees from data/23.auto.clean...
Loaded 2,416 test examples.
Parsing test sentences...
test-fscore (Recall=6.51, Precision=6.70, FScore=6.61) test-elapsed 0h00m34s
best test W. Punct: ucorr: 7738, lcorr: 1426, total: 56684, uas: 13.65%, las: 2.52%, ucm: 1.74%, lcm: 0.75%
best test Wo Punct: ucorr: 7150, lcorr: 1256, total: 49893, uas: 14.33%, las: 2.52%, ucm: 1.82%, lcm: 0.75%
best test Root: corr: 2416, total: 2416, acc: 100.00%

Punctuation must be separated from words？

Punctuation must be separated from words？This is inconvenient

sentences：
1 no , it was n't black monday .
2 no , it was n't black monday.
3 no, it was n't black monday .
4 no, it was n't black monday.
5 some " circuit breakers " installed after the october 1987 crash failed their first test , traders say , unable to cool the selling panic in both stocks and futures .
6 some "circuit breakers" installed after the october 1987 crash failed their first test, traders say, unable to cool the selling panic in both stocks and futures.

the head：
1 [7, 7, 7, 7, 7, 7, 0, 7]
2 [6, 6, 6, 6, 6, 0, 6]
3 [6, 6, 6, 6, 6, 0, 6]
4 [5, 5, 5, 5, 0, 5]
5 [4, 4, 4, 12, 4, 4, 6, 11, 11, 11, 7, 0, 15, 15, 12, 18, 18, 12, 18, 12, 22, 20, 25, 25, 22, 25, 28, 26, 28, 28, 12]
6 [2, 10, 2, 2, 4, 9, 9, 9, 5, 0, 12, 10, 16, 16, 16, 10, 18, 16, 21, 21, 18, 21, 24, 22, 24, 10]

How to run the parser?

Can you provide a sample of the command line to run the parser with the pre-trained models?

Instructions for Generating Dependency Trees for CTB

Hi,

I have followed distance-parser to preprocess the CTB data, but what are the exact steps for generating dependency trees (They only do constituency parsing)? Thank you!

Using HPSG-Neural-Parser for parsing raw sentence

Hello, I am trying to use your work to parse a raw sentence, but I can't find any sample input files showing the format of the sentence. Should I use conll-U format for the input file and/or what linguistic information do I need to provide i.e. lemma, POS, MWT, ...

I have been working on this issue for few days now and I have tried to pass to parse.sh a raw sentence in a single line, empty dependency tree with POS, and conll-U format but without success.

When I pass the following raw sentence:
Which German cities have more than 250000 inhabitants?
I get the following parse tree:
(SBARQ (WHNP (UNK Which) (UNK German) (UNK cities)) (VP (UNK have) (NP (UNK more) (UNK than) (UNK 250000))) (UNK inhabitants?))

Thanks
Mohamed Eldesouki

Access to Stanford parser 3.3.0

Hi,

The link in the README does not allow to download the stanford parser 3.3.0 -- only recent versions are available.

One could find it here though: https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-parser/3.3.0

Experiment with XLnet?

Firstly I would like to say that reading your paper was fascinating.
Secondly I would like to thank you for advancing the state of the art on both constituency parsing and dependency parsing. (first place on NLP-progress)

I've not yet read all your paper, but it seems you used BERT, and BERT was state of the art, but is no longer.
It has been obscoleted by significant margins by [XLnet] (https://github.com/zihangdai/xlnet)
I think it would be really interesting to train your neural net with XLnet instead of BERT to see if you can advance even more the state of the art!

Runtime Error: The size of tensor a (1100) must match the size of tensor b (695) at non-singleton dimension 0

Hey @DoodleJZ
I came across this error while running your parser. Could please look into this and fix this ?

Traceback (most recent call last):
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/main.py", line 746, in
main()
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/main.py", line 742, in main
args.callback(args)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/main.py", line 672, in run_parse
syntree, _ = parser.parse_batch(subbatch_sentences)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/Zparser.py", line 1364, in parse_batch
extra_content_annotations=extra_content_annotations)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/Zparser.py", line 822, in forward
res, current_attns = attn(res, batch_idxs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/Zparser.py", line 344, in forward
return self.layer_norm(outputs + residual), attns_padded
RuntimeError: The size of tensor a (1100) must match the size of tensor b (695) at non-singleton dimension 0

Which version of CTB to use?

Hi,

I'm wondering which version of CTB was used in the experiments in the paper.

CTB 5.1 is listed in the paper. But this repo links to distant-parser, which then links to CTB 8.0.

Thanks!