Giter Site home page Giter Site logo

aliwasan / lplangid Goto Github PK

View Code? Open in Web Editor NEW

This project forked from barseghyanartur/lplangid

0.0 0.0 0.0 966 KB

This package is a python implementation of the classifier described in the paper "Language Identification with a Reciprocal Rank Classifier".

License: MIT License

Shell 5.01% Python 94.99%

lplangid's Introduction

lplangid: Reciprocal Rank Classifier for Language Identification

This package is a python implementation of the classifier described in the paper Language Identification with a Reciprocal Rank Classifier.

For more detailed package documentation, see the project wiki.

Installation and Usage in Classification

You can install the package by running $ pip install lplangid which installs the package from the distribution at https://pypi.org/project/lplangid/, or by cloning this repository and running pip install -e . in this directory.

Basic usage example for language classification:

>>> from lplangid.language_classifier import RRCLanguageClassifier
>>> my_classifier = RRCLanguageClassifier.default_instance()
>>> my_classifier.get_winner("C'est use teste")

'fr'

A single 'correct' language is not always the most appropriate output. For more informative options, see RecommendedUsagePatterns.

Data Preparation and Distribution

Throughout this package, languages are identified and referred to using 2-letter ISO 339-1 codes. For example, en for English, es (from Español) for Spanish, zh (from 中文, Zhōngwén) for Chinese. These are used throughout for directory names, keys in dictionary tables, and reporting classifier results.

The classifier uses the datafiles checked in to the ./lplangid/freq_data directory here, which is just a few megabytes. It would be relatively easy to decouple the way these files are distributed. The benefit of combining them is it's very easy for clients to use.

The frequency tables in ./lplangid/freq_data are from Wikipedia data (single shards), tokenized on whitespace. In addition, a few conversational words from ./training/data_overrides.py have been added at the top of the term rank files. The xx_char_freq.csv files contain characters and sample frequencies. The xx_term_rank.csv files contain only the terms / words. Only the ranks (line numbers) of the words in these files matters. Unlike most classifier models, you can edit these files directly. For example, the word "bye" and other conversational terms that are rare in Wikipedia have already been added to the top of the en_term_rank.csv file.

See ./training/README.md for data preparation instructions and tools for adding new languages to the classifier.

lplangid's People

Contributors

dwiddows avatar stephantul avatar erip avatar meirotstein avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.