langdet's Introduction

langdet

Language detector. Returns scores identifying a human language from a sample text.

This is a vary simple KNN language classifier that uses character shingles to determine which language a piece of text is written in. Can be used in multilingual documents.

Usage

train-langdet.py is the model trainer. It reads in a file containing text from a known language and the outputs a JSON file containing the corresponding detection model. Various options configure the model, but --ngramSize 3 is suggested.

test-langdet.py is a sample script for identifying a language. It is very simple, and can be used as an example for integrating language detection into another codebase.

processWikiAbstracts.py is a utility function for converting Wikipedia pages into usable text corpora for training purposes. Wikipedia abstracts can be obtained from https://dumps.wikimedia.org/enwiki/latest/, or by running download-abstracts.sh <lang>

Recommend Projects

jonathankoren / langdet Goto Github PK

langdet's Introduction

langdet

Usage

langdet's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent