Giter Site home page Giter Site logo

huaxinyuan / iparser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hankcs/iparser

0.0 0.0 0.0 71 KB

Yet another dependency parser, integrated with tokenizer, tagger and visualization tool.

Home Page: http://iparser.hankcs.com/

License: GNU General Public License v3.0

Python 65.01% Perl 25.82% HTML 9.18%

iparser's Introduction

IParser: Industrial Strength Dependency Parser

Yet another multilingual dependency parser, integrated with tokenizer, part-of-speech tagger and visualization tool. IParser can parse raw sentence to dependency tree in CoNLL format, and is able to visualize trees in your browser.

See live demo!

Currently, iparser is in a prototype state. It makes no warranty and may not be ready for practical usage.

Install

pip3 install iparser --process-dependency-links

Quick Start

CLI

Interactive Shell

You can play with IParser in an interactive mode:

$ iparser parse
I looove iparser!
1	I	_	_	PRP	_	2	nsubj	_	_
2	looove	_	_	VBP	_	0	root	_	_
3	iparser	_	_	NN	_	2	dobj	_	_
4	!	_	_	.	_	2	punct	_	_

You type a sentence, hit enter, IParser will output its dependency tree.

  • Use iparser segment or iparser tag for word segmentation or part-of-speech tagging
  • Some models may take a while to load
  • IParser is language-agnostic, pre-trained models are provided for both English and Chinese, shipped in the installation package. The default model is PTB (English), you can switch to CTB (Chinese) via appending --language cn
  • Append --help to see the detailed manual

Pipeline

$ iparser segment <<< '商品和服务'        
商品 和 服务

$ iparser tag <<< 'I looove iparser!'   
I/PRP looove/VBP iparser/NN !/.

$ iparser parse <<< 'I looove iparser!' 
1	I	_	_	PRP	_	2	nsubj	_	_
2	looove	_	_	VBP	_	0	root	_	_
3	iparser	_	_	NN	_	2	dobj	_	_
4	!	_	_	.	_	2	punct	_	_
  • iparser is a compatible pipeline for standard I/O redirection. You can use iparser directly in terminal without writing codes.

API

IParser

The all-in-one interface is provided by class IParser:

$ python3
>>> from iparser import *
>>> iparser = IParser(pos_config_file=PTB_POS, dep_config_file=PTB_DEP)
>>> print(iparser.tag('I looove iparser!'))
[('I', 'PRP'), ('looove', 'VBP'), ('iparser', 'NN'), ('!', '.')]
>>> print(iparser.parse('I looove iparser!'))
1	I	_	_	PRP	_	2	nsubj	_	_
2	looove	_	_	VBP	_	0	root	_	_
3	iparser	_	_	NN	_	2	dobj	_	_
4	!	_	_	.	_	2	punct	_	_

You can load models trained on different corpora to support multilingual:

>>> iparser = IParser(seg_config_file=CTB_SEG, pos_config_file=CTB_POS, dep_config_file=CTB_DEP)
>>> print(iparser.parse('我爱依存分析!'))
1	我	_	_	PN	_	2	nsubj	_	_
2	爱	_	_	VV	_	0	root	_	_
3	依存	_	_	VV	_	2	ccomp	_	_
4	分析	_	_	VV	_	3	comod	_	_
5	!	_	_	PU	_	2	punct	_	_

If you only want to perform an intermediate step, you can checkout the following APIs.

Word Segmentation

>>> segmenter = Segmenter(CTB_SEG).load()
>>> segmenter.segment('下雨天地面积水')
['下雨天', '地面', '积水']
  • Notice that you need to call load to indicate that you want to load a pre-trained model, not to prepare an empty model for training.

Part-of-Speech Tagging

>>> tagger = POSTagger(PTB_POS).load()
>>> tagger.tag('I looove languages'.split())
[('I', 'PRP'), ('looove', 'VBP'), ('languages', 'NNS')]
  • POSTagger is not responsible for word segmentation. Do segmentation in advance or use IParser for convenience.

Dependency Parsing

>>> parser = DepParser(PTB_DEP).load()
>>> sentence = [('Is', 'VBZ'), ('this', 'DT'), ('the', 'DT'), ('future', 'NN'), ('of', 'IN'), ('chamber', 'NN'), ('music', 'NN'), ('?', '.')]
>>> print(parser.parse(sentence))
1	Is	_	_	VBZ	_	4	cop	_	_
2	this	_	_	DT	_	4	nsubj	_	_
3	the	_	_	DT	_	4	det	_	_
4	future	_	_	NN	_	0	root	_	_
5	of	_	_	IN	_	4	prep	_	_
6	chamber	_	_	NN	_	7	nn	_	_
7	music	_	_	NN	_	5	pobj	_	_
8	?	_	_	.	_	4	punct	_	_
  • DepParser is neither responsible for segmentation nor tagging.
  • The input must be a list of tuples of word and tag.

Server

$ iparser serve --help
usage: iparser serve [-h] [--port PORT]

A http server for IParser

optional arguments:
  -h, --help   show this help message and exit
  --port PORT

Train Models

IParser is designed to be language-agnostic, which means it has universal language support, only need to prepare some corpora of a desired language.

Corpus Format

The format is described here: https://github.com/hankcs/TreebankPreprocessing

Configuration File

IParser employs configuration files to ensure the same network is created before and after serialization, in training phase and testing phase accordingly. This is important for research engineers who want to fine-tune those hyper parameters, or train new models on third language corpora. Configuration template files are provided with all configurable parameters for users to adjust.

You can check out templates shipped with the iparsermodels, e.g.

python3
>>> from iparser import *
>>> PTB_DEP
'/usr/local/python3/lib/python3.6/site-packages/iparsermodels/ptb/dep/config.ini'

CLI

The CLI is not only capable for prediction, but can also perform training. Only requires a configuration file.

$ iparser segment --help
usage: iparser segment [-h] [--config CONFIG] [--action ACTION]

optional arguments:
  -h, --help       show this help message and exit
  --config CONFIG  path to config file
  --action ACTION  Which action (train, test, predict)?
  • --action train is what you are looking for.

API

The training APIs can be found in tests/train_parser.py etc.

Performance

tag

dep

The character model seems to be useless for English and Chinese, so it is disabled by default.

Acknowledgments

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.