Giter Site home page Giter Site logo

amakukha / stemmers_ukrainian Goto Github PK

View Code? Open in Web Editor NEW
27.0 4.0 4.0 133 KB

A novel stemmer for the Ukrainian language trained with AI

Python 81.00% Makefile 0.18% C 18.82%
stemmers stemming ukrainian ukrainian-language nlp natural-language-processing machine-learning

stemmers_ukrainian's Introduction

Stemmers for Ukrainian

This repository introduces a new stemmer for the Ukrainian language (tree_stem) created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms.

The proposed algorithm does not use dictionary lookups while maintaining a reasonably small size (48 KB of Python bytecode). It works faster than lemmatization approach by a factor of x24, and outperforms other stemming algorithms in speed as well.

In addition to the new algorithm, this repository also contains Python ports of some of the previously published stemmers.

Comparison of stemmers for the Ukrainian language

Stemmer Languages UI OI ERRT
Dictionary-based (reference) 0.0172 3.59e-06 0.0244
tree_stem Python 0.0907 2.71e-06 0.125
pymorphy2 (Paper) Python 0.324 2.01e-07 0.391
stemka C++ 0.329 2.34e-06 0.447
tapkomet Snowball, C, Java 0.447 2.73e-06 0.603
vgrichina Groovy, Python 0.497 1.05e-06 0.651
drupal JS, Python 0.511 7.54e-07 0.666
tochytskyi (Paper) PHP, Python 0.623 5.72e-07 0.795
No stemming 1.00 1.69e-08

where:

  • UI – understemming index
  • OI – overstemming index
  • ERRT – error rate relative to truncation

Notes:

  • pymorphy2 is a dictionary-assisted lemmatizer and morphological analyzer which was included into this comparison for reference. The most probable normal form is used as a stem.
  • training and testing was performed on a dictionary of word forms.

References

stemmers_ukrainian's People

Contributors

amakukha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

stemmers_ukrainian's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.