Giter Site home page Giter Site logo

kiranramnath007 / languagedetector Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 108 KB

This is a multi-class random forest based language detector that works by using words, bigrams and trigrams as features. Support is provided for 7 languages using the Latin Script (English, German, Portuguese, Spanish, French, Dutch, Italian).

Jupyter Notebook 84.60% Python 15.40%
natural- random-forest-classifier python3 scikitlearn-machine-learning language-detection

languagedetector's Introduction

languageDetector

This is a multi-class random forest based language detector that works by using words, bigrams and trigrams as features. Support is provided for 7 languages using the Latin Script (English, German, Portuguese, Spanish, French, Dutch, Italian). To optimize training and testing performance by reducing number of features,a corpus of 300,000 sentences for each language is leveraged from Leipzig Corpus and the 50 most frequent words, bigrams, and trigrams are shortlisted as features. The dataframe creation is slightly complicated, but it is highly vectorized to speed up performance. All train and test datapoints are then represented in the reduced feature-space. A model trained on 5,000 sentences from each language takes less than 2 minutes to train, and performs at 98% accuracy. To replicate the environment, please place the following data files sourced from this Google Drive directory, and assign that to 'dirname'

  • deu_mixed-typical_2011_300K-sentences.txt
  • eng_news_2005_300K-sentences.txt
  • fra_mixed_2009_300K-sentences.txt
  • ita_mixed-typical_2017_300K-sentences.txt
  • nld_mixed_2012_300K-sentences.txt
  • por_newscrawl_2011_300K-sentences.txt
  • spa_news_2006_300K-sentences.txt

Performance

The longest task is that of finding most common features for every language (~ 1 minute per language). The training dataframe creation then takes ~ 2 minutes, and creating the random forest model takes ~ 1 minute. The following performance metrics are calculated

  • ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘›= ๐‘‡๐‘Ÿ๐‘ข๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’๐‘ /(๐‘‡๐‘Ÿ๐‘ข๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘ก๐‘–๐‘ฃ๐‘’๐‘ +๐น๐‘Ž๐‘™๐‘ ๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’๐‘ )
  • ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™= ๐‘‡๐‘Ÿ๐‘ข๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’๐‘ /(๐‘‡๐‘Ÿ๐‘ข๐‘’ ๐‘ƒ๐‘œ๐‘ ๐‘ก๐‘–๐‘ฃ๐‘’๐‘ +๐น๐‘Ž๐‘™๐‘ ๐‘’ ๐‘๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘ฃ๐‘’๐‘ )
  • ๐น1 ๐‘†๐‘๐‘œ๐‘Ÿ๐‘’= 2โˆ— ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘›โˆ— ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘›+๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™

In general, the model performs with 98% Precision, 98% Recall and 98% F1-Score.

alt text

Confusion matrix

The performance is also examined using the confusion matrix, that tells us the distribution of predicted labels v/s actual labels.

alt text

Novel ideas

Using all bigrams, trigrams, and words will blow up the feature space and impact performance adversely. Hence the features are first shortlisted on the basis of most frequent features. This results in optimal performance both in terms of model accuracy and time taken.

Scope for improvement

There is a need to prune feature space to further remove redundancies. One approach could be through the use of maximal substrings. For eg - the trigram ' a ' will be a substring of ' a' always and can be removed. The size of the training data (for feature shortlisting) can be reduced and an optimal size of data can be explored. Furthermore, the accuracy of the model for French and Portuguese can be bettered through use of slightly more features for these two languages in particular.

languagedetector's People

Contributors

kiranramnath007 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.