Giter Site home page Giter Site logo

lijianxin520 / treebankpreprocessing Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hankcs/treebankpreprocessing

0.0 0.0 0.0 31 KB

Python scripts preprocessing Penn Treebank and Chinese Treebank

Home Page: http://www.hankcs.com/nlp/ptb-ctb-python.html

License: GNU General Public License v3.0

Python 100.00%

treebankpreprocessing's Introduction

TreebankPreprocessing

Python scripts preprocessing Penn Treebank (PTB) and Chinese Treebank 5.1 (CTB). They can convert treebanks to:

Corpus Format Description
constituency parse tree .txt one line for one sentence
dependency parse tree .conllx Basic Stanford Dependencies (SD)
word segmentation corpus .tsv first column for characters, second column for BMES tags, sentences separated by a blank line
part-of-speech tagging corpus .tsv first column for words, second column for tags, sentences separated by a blank line

When designing a tagger or parser, preprocessing treebanks is a troublesome problem. We need to:

  • Split dataset into train/dev/test, following conventional splits.
  • Remove xml tags inside CTB.
  • Combine the multiline bracketed files into one file, one line for one sentence.

I wondered why there were no open-source tools handling these tedious works. Finally I decide to write one myself. Hopefully it will save you some time.

Required software

  • Python3
  • NLTK
  • Optional stanford-parser for converting to dependency parse trees

Overview

What kind of task can we perform on treebanks?

Chinese Word Segmentation

For CTB, segmentation corpus are split as per Jiang et al. (2009):

  • CTB Training: 001–270, 400–1151. Development: 301–325. Test: 271-300.

Part-of-Speech Tagging

  • PTB Training: 0-18. Development: 19-21. Test: 22-24. As per Collins (2002) and Choi (2016).
  • CTB The same with Chinese Word Segmentation.

Phrase Structure Parsing

These scripts can also convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:

  • PTB Training: 02-21. Development: 22. Test: 23.
  • CTB Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.

Dependency Parsing

You will need Stanford Parser for converting phrase structure trees to dependency parse trees. Please download the Stanford Parser Version 3.3.0 and place them in this folder:

TreebankPreprocessing
├── ...
├── stanford-parser-3.3.0-models.jar
└── stanford-parser.jar

OK, let's do it on the fly.

PTB

1. Import PTB into NLTK

Bracketed files parsing relies on NLTK. Please follow NLTK instruction, put BROWN and WSJ into nltk_data/corpora/ptb, e.g.

ptb
├── BROWN
└── WSJ

2. Run ptb.py

This script does all the work for you, only requires a path to store output.

$ python3 ptb.py --help 
usage: ptb.py [-h] --output OUTPUT [--task TASK]

Combine Penn Treebank WSJ MRG files into train/dev/test set

optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  The folder where to store the output train/dev/test files
  --task TASK      Which task (par, pos)? Use par for phrase structure
                   parsing, pos for part-of-speech tagging
  • You will get 3 .txt files corresponding to train/dev/test set.
  • If you want part-of-speech tagging corpora, simply append --task pos. This time, you get 3 .tsv files.
  • .txt files can be converted to .conllx files by tb_to_stanford.py:
$ python3 tb_to_stanford.py --help
usage: tb_to_stanford.py [-h] --input INPUT --lang LANG --output OUTPUT

Convert combined Penn Treebank files (.txt) to Stanford Dependency format
(.conllx)

optional arguments:
  -h, --help       show this help message and exit
  --input INPUT    The folder containing train.txt/dev.txt/test.txt in
                   bracketed format
  --lang LANG      Which language? Use en for English, cn for Chinese
  --output OUTPUT  The folder where to store the output
                   train.conllx/dev.conllx/test.conllx in Stanford Dependency
                   format

CTB

The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html).

$ python3 ctb.py --help           
usage: ctb.py [-h] --ctb CTB --output OUTPUT [--task TASK]

Combine Chinese Treebank 5.1 fid files into train/dev/test set

optional arguments:
  -h, --help       show this help message and exit
  --ctb CTB        The root path to Chinese Treebank 5.1
  --output OUTPUT  The folder where to store the output
                   train.txt/dev.txt/test.txt
  --task TASK      Which task (seg, pos, par)? Use seg for word segmentation,
                   pos for part-of-speech tagging, par for phrase structure
                   parsing
  • Tagging and dependency parsing corpora can be obtained similar to PTB.

Then you can start your research, enjoy it!

treebankpreprocessing's People

Contributors

hankcs avatar kknd21988 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.