Giter Site home page Giter Site logo

persianpostagger's Introduction

PersianPoSTagger

The purpose of this project is to train a MaxEnt (Maximum Entropy) Persian Part of Speech Tagger.The project includes three parts:

1. Port:

First we port the BijanKhan corpus into a format that is acceptable by OpenNLP training module. Since the corpus is too large, you must download the unicode corpus from University of Tehran portal. And then save it under corpus/BijanKhan_original.txt. Once you run the Port program, a modified corpus will be created: corpus/BijanKhan_opennlp.txt

2. Train:

Now that you have the proper training corpus, you can train a model by running Train.scala. This will create the model under model/fa-pos-maxent.bin

3. Try it out:

Sample.scala is a small sample program that shows how to run the PoS Tagger on piece of a news text.

Modified Tag Set

The defined tag set is modified so that OpenNLP pick up the complete tag names. We have removed the _s. So the new tag set is as follows:

    Tag		Description
----------------------------------
ADJ		Adjective, General
ADJCMPR	        Adjective, Comparative
ADJINO		Past Participle
ADJORD		Adjective, Ordinal
ADJSIM		Adjective, Simple
ADJSUP		Adjective, Superlative
ADV		Adverb, General
ADVEXM		Adverb, Exemplar
ADVI		Adverb, Question
ADVNEGG	        Adverb, Negation
ADVNI		Adverb, Not Question
ADVTIME	        Adverb, Time
AR		Arabic Word
CON		Conjunction
DEFAULT		Default
DELM		Delimiter
DET		Determiner
IF		Conditional
INT		Interjection
MORP		Morpheme
MQUA		Modifier of Quantifier
MS		Mathematic Symbol
NPL		Noun, Plural
NSING		Noun, Singular
NN		Number
NP		Noun Phrase
OH		Oh Interjection ( حرف ندا)
OHH		Oh noun (منادی)
P		Preposition
PP		Prepositional Phrase
PRO		Pronoun 
PS		Psedo-Sentence
QUA		Quantifier
SPEC		Specifier
VAUX		Verb, Auxiliary
VIMP		Verb, Imperative
VPA		Verb, Past Tense
VPRE		Verb, Predicative
VPRS		Verb, Present Tense
VSUB		Verb, Subjunctive

Evaluation

This is my first stab at it. No real evaluation is performed on the model. It is possible to break down the model into 85% and 15% portions and train using the 85% file and evaluate with the unseen data using the 15% file. However, a quick subjective look at the sample output shows that the model has some errors w.r.t real news data. Consider the test output:

(باراک,NSING)(اوباما،,P)(رئیس‌جمهوری,NSING)(آمریکا,NSING)(از,P)(تغییراتی,NSING)(در,P)(شیوه,NSING)(جمع‌آوری,NSING)(اطلاعات,NPL)(و,CON) (شنود,VPA)(تلفن‌ها,NPL)(توسط,P)(سازمان‌های,NSING)(امنیتی,NSING)(این,NSING)(کشور,NSING)(خبر,NSING)(داده,ADJINO)(است,VPRE)(.,DELM)

To begin with, اوباما is not a proposition. Most probably, the model picked it up as a proposition because of با or او. So that was interesting. THANKS OBAMA! Other than that, I have to refresh my Farsi grammar, but I also don't think that تغییراتی is NSING. Yet the model picked اطلاعات as plural noun which is very similar to تغییراتی. I think ی نکره is taken into account - so really don't know if that should be plural or singular.

Anyhow, this is a good start for now. Any suggestion and/or corrections are welcomed.

Cheers! هیچ هکر

persianpostagger's People

Contributors

rfarahmand avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.