Giter Site home page Giter Site logo

slo_pos's Introduction

NLTK Slovenian POS tagger

Python application

This is a project that uses IJS JOS-1M corpus to train a part-of-speech tagger for Slovenian language.

Quick usage

POS tagger is available on PyPi with prebuilt dictionary. Installation:

pip install slopos

Usage:

import slopos

slopos.tag("Jaz sem iz okolice Ljubljane")

> [('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Tag reference is contained in tag_reference-sl.txt (slovenian) and tag_reference-en.txt files respectively.

Prepared files

The corpus was processed in several ways to prepare it for NLTK consumption. Partial files are part of this repository.

  1. Original corpus

    Original corpus is stored in multple split XML files, which are here stored in xml directory.

  2. Partial text files

    XML files have been processed and converted into a NLTK readable word/tag format using convert_xml_to_txt.py script. The processed files are stored in txt directory.

  3. NLTK tagged corpus

    Files from txt directory have been combined into a single file and stored in data/tagged_corpus directory for nltk-trainer consumption.

Training the POS tagger

POS tagger is trained using nltk-trainer project, which is included as a submodule in this project.

Install dependencies

virtualenv .
pip install -r requirements
pip install numpy
python nltk-trainer/setup.py install

Convert input files

python convert_xml_to_txt.py

Train

In top project directory run the trainer:

python nltk-trainer/train_tagger.py data/tagged_corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --filename slopos/sl-tagger.pickle

It'll take a short while and you should see output in form of

loading data/tagged_corpus
15758 tagged sents, training on 15758
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=11492>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=109127>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=130795>
evaluating TrigramTagger
accuracy: 0.930942
creating directory out
dumping TrigramTagger to out/sl-tagger.pickle

The trained tagger will be deposited in out directory with name of sl-tagger.pickle.

Using the POS tagger

POS tagger is stored in form of Python pickle file after creation and you will need NLTK installed.

Usage:

import pickle
sl_tagger = pickle.load(open('out/sl-tagger.pickle', 'rb'))

sl_tagger.tag(["Jaz", "sem", "iz", "okolice", "Ljubljane"])

> [('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Note that punctionation should be stripped from words for proper detection. Tag reference is contained in tag_reference-sl.txt (slovenian) and tag_reference-en.txt files respectively.

slo_pos's People

Contributors

izacus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.