Giter Site home page Giter Site logo

doodlejz / hpsg-neural-parser Goto Github PK

View Code? Open in Web Editor NEW
105.0 4.0 25.0 6.93 MB

Source code for "Head-Driven Phrase Structure Grammar Parsing on Penn Treebank" published at ACL 2019

Home Page: https://arxiv.org/abs/1907.02684

License: MIT License

Makefile 0.01% Scilab 0.34% C 3.33% Perl 0.01% Shell 0.19% Python 96.13%
nlp parsing machine-learning syntactic-parsing parser natural-language-processing hpsg-neural-parser bert xlnet elmo

hpsg-neural-parser's Introduction

HPSG Neural Parser

This is a Python implementation of the parsers described in "Head-Driven Phrase Structure Grammar Parsing on Penn Treebank" from ACL 2019.

Contents

  1. Requirements
  2. Training
  3. Citation
  4. Credits

Requirements

  • Python 3.6 or higher.
  • Cython 0.25.2 or any compatible version.
  • PyTorch 0.4.0. This code has not been tested with PyTorch 1.0, but it should work.
  • EVALB. Before starting, run make inside the EVALB/ directory to compile an evalb executable. This will be called from Python for evaluation.
  • AllenNLP 0.7.0 or any compatible version (only required when using ELMo word representations)
  • pytorch-transformers PyTorch 1.0.0+ or any compatible version (only required when using BERT and XLNet, XLNet only for joint span version.)

Pre-trained Models (PyTorch)

The following pre-trained parser models are available for download:

The pre-trained model with Glove embeddings obtains 93.78 F-scores of constituent parsing and 96.09 UAS, 94.68 LAS of dependency parsing on the test set.

The pre-trained model with BERT obtains 95.84 F-scores of constituent parsing and 97.00 UAS, 95.43 LAS of dependency parsing on the test set.

The pre-trained model with XLNet obtains 96.33 F-scores of constituent parsing and 97.20 UAS, 95.72 LAS of dependency parsing on the test set.

To use ELMo embeddings, download the following files into the data/ folder (preserving their names):

There is currently no command-line option for configuring the locations/names of the ELMo files.

Pre-trained BERT and XLNet weights will be automatically downloaded as needed by the pytorch-transformers package.

Training

Download the 3 PTB data files from https://github.com/nikitakit/self-attentive-parser/tree/master/data, and put them in the data/ folder. The dependency structures are mainly obtained by converting constituent structure with version 3.3.0 of Stanford Parser in the data/ folder:

java -cp stanford-parser_3.3.0.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile 02-21.10way.clean > ptb_train_3.3.0.sd

For CTB, we use the same datasets and preprocessing from the Distance Parser. For PTB, we use the same datasets and preprocessing from the self-attentive-parser. GloVe embeddings are optional.

Training Instructions

Some of the available arguments are:

Argument Description Default
--model-path-base Path base to use for saving models N/A
--evalb-dir Path to EVALB directory EVALB/
--train-ptb-path Path to training constituent parsing data/02-21.10way.clean
--dev-ptb-path Path to development constituent parsing data/22.auto.clean
--dep-train-ptb-path Path to training dependency parsing data/ptb_train_3.3.0.sd
--dep-dev-ptb-path Path to development dependency parsing data/ptb_dev_3.3.0.sd
--batch-size Number of examples per training update 250
--checks-per-epoch Number of development evaluations per epoch 4
--subbatch-max-tokens Maximum number of words to process in parallel while training (a full batch may not fit in GPU memory) 2000
--eval-batch-size Number of examples to process in parallel when evaluating on the development set 30
--numpy-seed NumPy random seed Random
--use-words Use learned word embeddings Do not use word embeddings
--use-tags Use predicted part-of-speech tags as input Do not use predicted tags
--use-chars-lstm Use learned CharLSTM word representations Do not use CharLSTM
--use-elmo Use pre-trained ELMo word representations Do not use ELMo
--use-bert Use pre-trained BERT word representations Do not use BERT
--use-xlnet Use pre-trained XLNet word representations Do not use XLNet
--pad-left When using pre-trained XLNet padding on left Do not pad on left
--bert-model Pre-trained BERT model to use if --use-bert is passed bert-large-uncased
--no-bert-do-lower-case Instructs the BERT tokenizer to retain case information (setting should match the BERT model in use) Perform lowercasing
--xlnet-model Pre-trained XLNet model to use if --use-xlnet is passed xlnet-large-cased
--no-xlnet-do-lower-case Instructs the XLNet tokenizer to retain case information (setting should match the XLNet model in use) Perform uppercasing
--const-lada Lambda weight 0.5
--model-name Name of model test
--embedding-path Path to pre-trained embedding N/A
--embedding-type Pre-trained embedding type glove
--dataset Dataset type ptb

Additional arguments are available for other hyperparameters; see make_hparams() in src/main.py. These can be specified on the command line, such as --num-layers 2 (for numerical parameters), --use-tags (for boolean parameters that default to False), or --no-partitioned (for boolean parameters that default to True).

For each development evaluation, the best_dev_score is the sum of F-score and LAS on the development set and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development best_dev_score.

As an example, after setting the paths for data and embeddings, to train a Joint-Span parser, simply run:

sh run_single.sh

to train a Joint-Span parser with BERT, simply run:

sh run_bert.sh

to train a Joint-Span parser with XLNet, simply run:

sh run_xlnet.sh

Evaluation Instructions

A saved model can be evaluated on a test corpus using the command python src/main.py test ... with the following arguments:

Argument Description Default
--model-path-base Path base of saved model N/A
--evalb-dir Path to EVALB directory EVALB/
--test-ptb-path Path to test constituent parsing data/23.auto.clean
--dep-test-ptb-path Path to test dependency parsing data/ptb_test_3.3.0.sd
--embedding-path Path to pre-trained embedding data/glove.6B.100d.txt.gz
--eval-batch-size Number of examples to process in parallel when evaluating on the test set 100
--dataset Dataset type ptb

As an example, after extracting the pre-trained model, you can evaluate it on the test set using the following command:

sh test.sh

If you want to parse the sentences, after setting the input file and pre-trained model, run following command:

sh parse.sh

Citation

If you use this software for research, please cite our paper as follows:

@inproceedings{zhou-zhao-2019-head,
    title = "Head-Driven Phrase Structure Grammar Parsing on {P}enn Treebank",
    author = "Zhou, Junru  and Zhao, Hai",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
}

Credits

The code in this repository and portions of this README are based on https://github.com/nikitakit/self-attentive-parser

hpsg-neural-parser's People

Contributors

doodlejz avatar msc42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hpsg-neural-parser's Issues

Auto Tags Issue

Hi, I'm a little bit confused if 02-21.10way.clean is auto tagged, how did you measure the constituency parsing score? I checked the 3th and 4th columns are the same. If they are auto tagged then you don't have gold truth for constituency parsing, otherwise the dependency parsing score is biased.

Punctuation must be separated from words?

Punctuation must be separated from words?This is inconvenient

sentences:
1 no , it was n't black monday .
2 no , it was n't black monday.
3 no, it was n't black monday .
4 no, it was n't black monday.
5 some " circuit breakers " installed after the october 1987 crash failed their first test , traders say , unable to cool the selling panic in both stocks and futures .
6 some "circuit breakers" installed after the october 1987 crash failed their first test, traders say, unable to cool the selling panic in both stocks and futures.

the head:
1 [7, 7, 7, 7, 7, 7, 0, 7]
2 [6, 6, 6, 6, 6, 0, 6]
3 [6, 6, 6, 6, 6, 0, 6]
4 [5, 5, 5, 5, 0, 5]
5 [4, 4, 4, 12, 4, 4, 6, 11, 11, 11, 7, 0, 15, 15, 12, 18, 18, 12, 18, 12, 22, 20, 25, 25, 22, 25, 28, 26, 28, 28, 12]
6 [2, 10, 2, 2, 4, 9, 9, 9, 5, 0, 12, 10, 16, 16, 16, 10, 18, 16, 21, 21, 18, 21, 24, 22, 24, 10]

No function pred_linearize

Loading model from models/joint_bert_dev=95.55_devuas=96.67_devlas=94.86.pt1...
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Parsing sentences...
Traceback (most recent call last):
File "src_joint/main.py", line 746, in
main()
File "src_joint/main.py", line 742, in main
args.callback(args)
File "src_joint/main.py", line 680, in run_parse
save_data(syntree_pred, cun)
File "src_joint/main.py", line 651, in save_data
output_file.write("{}\n".format(tree.pred_linearize()))
AttributeError: 'InternalParseNode' object has no attribute 'pred_linearize'

Error in preprocessing data/training CTB

Hi, I met an error in training with CTB dataset.

I download ctb8.0 from the link and preprocess the data following the guide of Github repo distance parser step 3 and get train/dev/test.txt. Then I transformed the three txt files with Stanford parser 3.3.0 to dependency structure and get train/dev/test.conll.

Running sh run_single.sh, I got an error:

Traceback (most recent call last):
  File "src_joint/main.py", line 746, in <module>
    main()
  File "src_joint/main.py", line 742, in main
    args.callback(args)
  File "src_joint/main.py", line 688, in <lambda>
    subparser.set_defaults(callback=lambda args: run_train(args, hparams))
  File "src_joint/main.py", line 225, in run_train
    train_parse = [tree.convert() for tree in train_treebank]
  File "src_joint/main.py", line 225, in <listcomp>
    train_parse = [tree.convert() for tree in train_treebank]
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 93, in convert
    children.append(child.convert(index = index))
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 93, in convert
    children.append(child.convert(index = index))
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 93, in convert
    children.append(child.convert(index = index))
  File "/data/lijiachen/HPSG-Neural-Parser/src_joint/trees.py", line 80, in convert
    assert sub_children[-1].right == sub_child.left, str(sub_children[-1].right)+'\t'+str(sub_child.left) #contiune span
AssertionError:

Look forward to your reply.

Runtime Error: The size of tensor a (1100) must match the size of tensor b (695) at non-singleton dimension 0

Hey @DoodleJZ
I came across this error while running your parser. Could please look into this and fix this ?

Traceback (most recent call last):
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/main.py", line 746, in
main()
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/main.py", line 742, in main
args.callback(args)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/main.py", line 672, in run_parse
syntree, _ = parser.parse_batch(subbatch_sentences)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/Zparser.py", line 1364, in parse_batch
extra_content_annotations=extra_content_annotations)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/Zparser.py", line 822, in forward
res, current_attns = attn(res, batch_idxs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/drive/My Drive/Hd/DependencyParser/src_joint/Zparser.py", line 344, in forward
return self.layer_norm(outputs + residual), attns_padded
RuntimeError: The size of tensor a (1100) must match the size of tensor b (695) at non-singleton dimension 0

Which version of CTB to use?

Hi,

I'm wondering which version of CTB was used in the experiments in the paper.

CTB 5.1 is listed in the paper. But this repo links to distant-parser, which then links to CTB 8.0.

Thanks!

UNK in produced parses

I see that the POS predictions (produced by parse.sh) are all UNKs for the constituency output. Can the model generate POS tags?

Using HPSG-Neural-Parser for parsing raw sentence

Hello, I am trying to use your work to parse a raw sentence, but I can't find any sample input files showing the format of the sentence. Should I use conll-U format for the input file and/or what linguistic information do I need to provide i.e. lemma, POS, MWT, ...

I have been working on this issue for few days now and I have tried to pass to parse.sh a raw sentence in a single line, empty dependency tree with POS, and conll-U format but without success.

When I pass the following raw sentence:
Which German cities have more than 250000 inhabitants?
I get the following parse tree:
(SBARQ (WHNP (UNK Which) (UNK German) (UNK cities)) (VP (UNK have) (NP (UNK more) (UNK than) (UNK 250000))) (UNK inhabitants?))

Thanks
Mohamed Eldesouki

Test set accuracy is very low

Loading model from models/joint_bert_dev=95.55_devuas=96.67_devlas=94.86.pt...
Reading dependency parsing data from data/ptb_test_3.3.0.sd
Loading test trees from data/23.auto.clean...
Loaded 2,416 test examples.
Parsing test sentences...
test-fscore (Recall=6.51, Precision=6.70, FScore=6.61) test-elapsed 0h00m34s
best test W. Punct: ucorr: 7738, lcorr: 1426, total: 56684, uas: 13.65%, las: 2.52%, ucm: 1.74%, lcm: 0.75%
best test Wo Punct: ucorr: 7150, lcorr: 1256, total: 49893, uas: 14.33%, las: 2.52%, ucm: 1.82%, lcm: 0.75%
best test Root: corr: 2416, total: 2416, acc: 100.00%

How to run the parser?

Can you provide a sample of the command line to run the parser with the pre-trained models?

Experiment with XLnet?

Firstly I would like to say that reading your paper was fascinating.
Secondly I would like to thank you for advancing the state of the art on both constituency parsing and dependency parsing. (first place on NLP-progress)

I've not yet read all your paper, but it seems you used BERT, and BERT was state of the art, but is no longer.
It has been obscoleted by significant margins by [XLnet] (https://github.com/zihangdai/xlnet)
I think it would be really interesting to train your neural net with XLnet instead of BERT to see if you can advance even more the state of the art!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.