PersianPoSTagger

The purpose of this project is to train a MaxEnt (Maximum Entropy) Persian Part of Speech Tagger.The project includes three parts:

1. Port:

First we port the BijanKhan corpus into a format that is acceptable by OpenNLP training module. Since the corpus is too large, you must download the unicode corpus from University of Tehran portal. And then save it under corpus/BijanKhan_original.txt. Once you run the Port program, a modified corpus will be created: corpus/BijanKhan_opennlp.txt

2. Train:

Now that you have the proper training corpus, you can train a model by running Train.scala. This will create the model under model/fa-pos-maxent.bin

3. Try it out:

Sample.scala is a small sample program that shows how to run the PoS Tagger on piece of a news text.

Modified Tag Set

The defined tag set is modified so that OpenNLP pick up the complete tag names. We have removed the _s. So the new tag set is as follows:

    Tag		Description
----------------------------------
ADJ		Adjective, General
ADJCMPR	        Adjective, Comparative
ADJINO		Past Participle
ADJORD		Adjective, Ordinal
ADJSIM		Adjective, Simple
ADJSUP		Adjective, Superlative
ADV		Adverb, General
ADVEXM		Adverb, Exemplar
ADVI		Adverb, Question
ADVNEGG	        Adverb, Negation
ADVNI		Adverb, Not Question
ADVTIME	        Adverb, Time
AR		Arabic Word
CON		Conjunction
DEFAULT		Default
DELM		Delimiter
DET		Determiner
IF		Conditional
INT		Interjection
MORP		Morpheme
MQUA		Modifier of Quantifier
MS		Mathematic Symbol
NPL		Noun, Plural
NSING		Noun, Singular
NN		Number
NP		Noun Phrase
OH		Oh Interjection ( حرف ندا)
OHH		Oh noun (منادی)
P		Preposition
PP		Prepositional Phrase
PRO		Pronoun 
PS		Psedo-Sentence
QUA		Quantifier
SPEC		Specifier
VAUX		Verb, Auxiliary
VIMP		Verb, Imperative
VPA		Verb, Past Tense
VPRE		Verb, Predicative
VPRS		Verb, Present Tense
VSUB		Verb, Subjunctive

Evaluation

This is my first stab at it. No real evaluation is performed on the model. It is possible to break down the model into 85% and 15% portions and train using the 85% file and evaluate with the unseen data using the 15% file. However, a quick subjective look at the sample output shows that the model has some errors w.r.t real news data. Consider the test output:

(باراک,NSING)(اوباما،,P)(رئیس‌جمهوری,NSING)(آمریکا,NSING)(از,P)(تغییراتی,NSING)(در,P)(شیوه,NSING)(جمع‌آوری,NSING)(اطلاعات,NPL)(و,CON) (شنود,VPA)(تلفن‌ها,NPL)(توسط,P)(سازمان‌های,NSING)(امنیتی,NSING)(این,NSING)(کشور,NSING)(خبر,NSING)(داده,ADJINO)(است,VPRE)(.,DELM)

To begin with, اوباما is not a proposition. Most probably, the model picked it up as a proposition because of با or او. So that was interesting. THANKS OBAMA! Other than that, I have to refresh my Farsi grammar, but I also don't think that تغییراتی is NSING. Yet the model picked اطلاعات as plural noun which is very similar to تغییراتی. I think ی نکره is taken into account - so really don't know if that should be plural or singular.

Anyhow, this is a good start for now. Any suggestion and/or corrections are welcomed.

Cheers! هیچ هکر

imclab / persianpostagger Goto Github PK