wooorm / franc Goto Github PK

View Code? Open in Web Editor NEW

4.0K 44.0 174.0 4.5 MB

Natural language detection

Home Page: https://wooorm.com/franc/

License: MIT License

JavaScript 100.00%

javascript natural-language language-detection nlp natural language detection detect classification classify

franc's Introduction

Detect the language of text.

What’s so cool about franc?

franc can support more languages^(†) than any other library
franc is packaged with support for 82, 186, or 419 languages
franc has a CLI

† - Based on the UDHR, the most translated copyright-free document in the world.

What’s not so cool about franc?

franc supports many languages, which means it’s easily confused on small samples. Make sure to pass it big documents to get reliable results.

Install

👉 Note: this installs the franc package, with support for 187 languages (languages which have 1 million or more speakers). franc-min (82 languages, 8m or more speakers) and franc-all (all 414 possible languages) are also available. Finally, use franc-cli to install the CLI.

This package is ESM only. In Node.js (version 14.14+, 16.0+), install with npm:

npm install franc

In Deno with esm.sh:

import {franc, francAll} from 'https://esm.sh/franc@6'

In browsers with esm.sh:

<script type="module">
  import {franc, francAll} from 'https://esm.sh/franc@6?bundle'
</script>

Use

import {franc, francAll} from 'franc'

franc('Alle menslike wesens word vry') //=> 'afr'
franc('এটি একটি ভাষা একক IBM স্ক্রিপ্ট') //=> 'ben'
franc('Alle menneske er fødde til fridom') //=> 'nno'

franc('') //=> 'und' (language code that stands for undetermined)

// You can change what’s too short (default: 10):
franc('the') //=> 'und'
franc('the', {minLength: 3}) //=> 'sco'

console.log(francAll('Considerando ser essencial que os direitos humanos'))
//=> [['por', 1], ['glg', 0.771284519307895], ['spa', 0.6034146900423971], …123 more items]

console.log(francAll('Considerando ser essencial que os direitos humanos', {only: ['por', 'spa']}))
//=> [['por', 1 ], ['spa', 0.6034146900423971]]

console.log(francAll('Considerando ser essencial que os direitos humanos', {ignore: ['spa', 'glg']}))
//=> [['por', 1], ['cat', 0.5367251059928957], ['src', 0.47461899851037015], …121 more items]

API

This package exports the identifiers franc, francAll. There is no default export.

`franc(value[, options])`

Get the most probable language for the given value.

Parameters

value (string) — value to test
options (Options, optional) — configuration

Returns

The most probable language (string).

`francAll(value[, options])`

Get the most probable language for the given value.

Parameters

value (string) — value to test
options (Options, optional) — configuration

Returns

Array containing language—distance tuples (Array<[string, number]>).

`Options`

Configuration (Object, optional) with the following fields:

`options.only`

Languages to allow (Array<string>, optional).

`options.ignore`

Languages to ignore (Array<string>, optional).

`options.minLength`

Minimum length to accept (number, default: 10).

CLI

Install:

npm install franc-cli --global

Use:

CLI to detect the language of text

Usage: franc [options] <string>

Options:

  -h, --help                    output usage information
  -v, --version                 output version number
  -m, --min-length <number>     minimum length to accept
  -o, --only <string>           allow languages
  -i, --ignore <string>         disallow languages
  -a, --all                     display all guesses

Usage:

# output language
$ franc "Alle menslike wesens word vry"
# afr

# output language from stdin (expects utf8)
$ echo "এটি একটি ভাষা একক IBM স্ক্রিপ্ট" | franc
# ben

# ignore certain languages
$ franc --ignore por,glg "O Brasil caiu 26 posições"
# src

# output language from stdin with only
$ echo "Alle mennesker er født frie og" | franc --only nob,dan
# nob

Data

Supported languages

Package	Languages	Speakers
`franc-min`	82	8M or more
`franc`	187	1M or more
`franc-all`	414	-

Language code

👉 Note: franc returns ISO 639-3 codes (three letter codes). Not ISO 639-1 or ISO 639-2. See also GH-10 and GH-30.

To get more info about the languages represented by ISO 639-3, use iso-639-3. There is also an index available to map ISO 639-3 to ISO 639-1 codes, iso-639-3/to-1.json, but note that not all 639-3 codes can be represented in 639-1.

Types

These packages are fully typed with TypeScript. They export the additional types TrigramTuple and Options.

Compatibility

These package are at least compatible with all maintained versions of Node.js. As of now, that is Node.js 14.14+ and 16.0+. They also works in Deno and modern browsers.

Ports

Franc has been ported to several other programming languages.

Elixir — paasaa
Erlang — efranc
Go — franco, whatlanggo
R — franc
Rust — whatlang-rs
Dart — francd
Python — pyfranc

The works franc is derived from have themselves also been ported to other languages.

Derivation

Franc is a derivative work from guess-language (Python, LGPL), guesslanguage (C++, LGPL), and Language::Guess (Perl, GPL). Their creators granted me the rights to distribute franc under the MIT license: respectively, Kent S. Johnson, Jacob R. Rideout, and Maciej Ceglowski.

Contribute

Yes please! See How to Contribute to Open Source.

Security

This package is safe.

License

MIT © Titus Wormer

franc's People

Contributors

Stargazers

Watchers

Forkers

elmor34 jeffhuys mgreschke hacksparrow cs1000 inno-v davidtran641 zingysain cchongxd giantliu22 lazycrazyowl paultol1988 weaver-viii surfcao yulongheli nvdnkpr alihalabyah andreireitz eiriklv muminoff karimabdul intfrr kamilbielawski dj31416 godeep lemonhall mokraft uyghurdev yeondudad 47billion wehlutyk borderlessnomad jaambee smari occrp johnantoni qq40660 srogov sankycse cortexmg anukat2015 bouriate cainoeraabele gitterebi nbpalomino munzuruleee maniacs-js agtlucas amilajack rybnik zectbynmo loretoparisi namjae ravinderjitsidhu oyeanuj modulexcite anikets geeza krodyx touch2hussain r21-iot poudelprakash tngamemo geraldoramos thequail rlugojr hhy5277 nilportugues benjamesbabala alvarlaigna pombredanne rubythonode athiwatp shazom manucutillas maxali compareagences stethd aakashapoorv garlandrileyjr jdrew1303 xs-bot amirjamil90 gitadept 7rin0 mohammedsabi vunb praveenmunagapati 3kraft solertis duckywang1 kostasx bedosport fengweijp revolunet roshanweb timdiggins abhinavm24 nihsmik sidneys

franc's Issues

Question about language accuracy

Is it normal that a sentence such as "show me my services" gets classified as spanish before english?

> franc.all('show me my services', {minLength:1, whitelist:['eng','spa']})
[ [ 'spa', 1 ], [ 'eng', 0.9074778200253486 ] ]

It looks weird to me since tokens like "sh", "my"... and some letters ("w" or "y" in a sentence) are really uncommon in spanish.

Browser Version

Is it possible to use franc directly in the browser?

Should allow white list AND/OR black list as options

Scoping trigram modelling to only certain (Western? African? &c.) languages could certainly speed things up.

Problems with latin alphabet languages

A term like yellow flicker beat suggest german, english (correct) quite far below.

Can you explain how this would work?

I would like to use franc in combination with a spell checker, first detecting the language and then looking up correct words with a spell checker using the identified language.

Should (better) explain what the "und" language code means

I got NaN when runinng franc.all

I run the following code:

franc.all('פאר טסי', {minLength: 3})
// result: [ [ 'heb', NaN ], [ 'ydd', NaN ] ]

Why I got NaN? Any quick fix?

npm versions of franc-all and franc-most?

I'd like to use franc-most or franc-all in a project, but if I can't use npm install and my package.json, it makes the project more difficult to work with and more difficult to collaborate on. I wasn't able to find them on npmjs.org. Are these available under a different name?

How to improve single word detection with limited list of supported languages

Hello again.
I currently have this:

var q = 'отличный';
var guessedLanguageCode = franc(q, {
    whitelist: ['eng', 'rus', 'spa']
}); // <- returns `und`

In this particular case, q contains letters that clearly are not part of neither the English, nor Spanish alphabet. Nevertherless, franc returns und. Is there any way to improve detection of single words when there is just a handful of languages we need to support?

152 languages with npm, not 175

I may be counting wrong, but in data.json, there appear to be around 152 languages, not the 175 languages the README describes. Was that number an approximation?

what steps to get franc-most.js

how do I get browser version of franc-most.js that can be included via script tag?

Getting weird results

Hey @wooorm am I doing something wrong here?

> apps.forEach(app => console.log(franc(app.description), app.description))

eng A universal clipboard managing app that makes it easy to access your clipboard from anywhere on any device
fra 5EPlay CSGO Client
nob Open-source Markdown editor built for desktop
eng Communication tool to optimize the connection between people
vmw Wireless HDMI
eng An RSS and Atom feed aggregator
eng A work collaboration product that brings conversation to your files.
src Pristine Twitter app
dan A Simple Friendly Markdown Note.
nno An open source trading platform
eng A hackable text editor for the 21 st Century
eng One workspace open to all designers and developers
nya A place to work + a way to work
cat An experimental P2P browser
sco Focused team communications
sco Bitbloq is a tool to help children to learn and create programs for a microcontroller or robot, and to load them easily.
eng A simple File Encryption application for Windows. Encrypt your bits.
eng Markdown editor witch clarity +1
eng Text editor with the power or Markdown
eng Open-sourced note app for programmers
sco Web browser that automatically blocks ads and trackers
bug Facebook Messenger app
dan Markdown editor for Mac / Windows / Linux
fra Desktop build status notifications
sco Group chat for global teams
src Your rubik's cube solves
sco Orthodox web file manager with console and editor
cat Game development tools
sco RPG style coding application
deu Modern browser without tabs
eng Your personal galaxy of inspiration
sco A menubar/taskbar Gmail App for Windows, macOS and Linux.

Russian is detected incorrectly.

How to contribute languages

Hi! I'm curious to know how to contribute new languages to franc? Do you have a standard method for creating the definition lines?

Possible on streaming data?

Hi,
first of all thanks for sharing such a nice tool.
I'd like to ask know like is that possible to use this get accurate date from twitter streaming api using nodejs.

the app am creating let the people to get trusted data about politics. currently am getting some few datas with accuracy from retweets only but i would like to get more accurate data from raw streaming tweets.

am streaming datas based on tracking some keywords
for example if i said "track : donald trump " this will give realtime tweets about trump. but the problem is its also returns some unwanted funny quotes, mime etc. i just want to have exact tweets for the purpose.

many thanks for any help and thanks again for this tool

Issue in detecting English

Hi, I found that language detection for basic English sentences is poor.
ex: var lan = franc.all( "I am not good at detecting languages." )

result: [ [ "dan", 1 ], [ "pam", 0.9966273187183811 ], [ "cat", 0.9858347386172007 ], [ "tpi", 0.9021922428330522 ], [ "nob", 0.8954468802698146 ], [ "tgl", 0.8671163575042158 ], [ "swe", 0.8526138279932547 ], [ "nno", 0.8094435075885329 ], [ "eng", 0.8084317032040472 ], [ "ind", 0.7925801011804384 ], [ "afr", 0.7895446880269814 ], [ "bcl", 0.7736930860033727 ], [ "jav", 0.7602023608768971 ], [ "ace", 0.742327150084317 ], [ "hil", 0.736593591905565 ], [ "ceb", 0.736256323777403 ], [ "lav", 0.7251264755480606 ], [ "hms", 0.7234401349072512 ], [ "tzm", 0.7234401349072512 ], [ "bug", 0.6934232715008432 ], [ "sco", 0.6664418212478921 ], [ "fra", 0.6657672849915683 ], [ "ban", 0.6620573355817876 ], [ "min", 0.6590219224283305 ], [ "deu", 0.6586846543001686 ], [ "ssw", 0.6344013490725127 ], [ "nld", 0.6259696458684654 ], [ "sun", 0.6236087689713322 ], [ "mos", 0.6145025295109612 ], [ "aka", 0.6040472175379427 ], [ "wol", 0.5854974704890388 ], [ "ilo", 0.5517706576728499 ], [ "war", 0.5450252951096122 ], [ "bem", 0.5386172006745362 ], [ "glg", 0.5365935919055649 ], [ "tiv", 0.5342327150084317 ], [ "src", 0.5338954468802698 ], [ "mad", 0.5258010118043845 ], [ "ckb", 0.5204047217537943 ], [ "nso", 0.5166947723440135 ], [ "run", 0.512310286677909 ], [ "uzn", 0.5119730185497471 ], [ "toi", 0.5089376053962901 ], [ "bci", 0.500168634064081 ], [ "nds", 0.49409780775716694 ], [ "tsn", 0.478920741989882 ], [ "als", 0.47858347386172007 ], [ "por", 0.47386172006745364 ], [ "tso", 0.47082630691399663 ], [ "spa", 0.4674536256323777 ], [ "sot", 0.466441821247892 ], [ "bam", 0.45834738617200677 ], [ "nya", 0.457672849915683 ], [ "lit", 0.45059021922428333 ], [ "rmn", 0.4499156829679595 ], [ "ndo", 0.44957841483979766 ], [ "tuk", 0.4458684654300169 ], [ "nyn", 0.4441821247892074 ], [ "snk", 0.44215851602023604 ], [ "kin", 0.4411467116357505 ], [ "uig", 0.4404721753794266 ], [ "ron", 0.4300168634064081 ], [ "zul", 0.4269814502529511 ], [ "emk", 0.42495784148397975 ], [ "lun", 0.42495784148397975 ], [ "nhn", 0.4215851602023609 ], [ "rmy", 0.41787521079258005 ], [ "hat", 0.41483979763912315 ], [ "ita", 0.41483979763912315 ], [ "ewe", 0.41180438448566614 ], [ "xho", 0.4101180438448566 ], [ "yao", 0.40775716694772346 ], [ "sna", 0.40067453625632377 ], [ "umb", 0.39932546374367617 ], [ "knc", 0.3942664418212479 ], [ "cjk", 0.3942664418212479 ], [ "kng", 0.39291736930860033 ], [ "hun", 0.3709949409780776 ], [ "plt", 0.37032040472175376 ], [ "kde", 0.36998313659359194 ], [ "som", 0.3595278246205733 ], [ "suk", 0.3591905564924115 ], [ "quy", 0.35750421585160197 ], [ "tur", 0.3534569983136594 ], [ "snn", 0.35143338954468806 ], [ "swh", 0.35109612141652613 ], [ "epo", 0.3504215851602024 ], [ "lug", 0.34974704890387853 ], [ "quz", 0.3490725126475548 ], [ "gaa", 0.34839797639123105 ], [ "men", 0.3463743676222597 ], [ "kmb", 0.34569983136593596 ], [ "ces", 0.3365935919055649 ], [ "dip", 0.33524451939291733 ], [ "est", 0.3349072512647555 ], [ "ayr", 0.33423271500843166 ], [ "hau", 0.3247892074198988 ], [ "dyu", 0.3163575042158516 ], [ "lin", 0.31365935919055654 ], [ "bin", 0.30826306913996626 ], [ "gax", 0.3032040472175379 ], [ "sag", 0.2930860033726813 ], [ "srp", 0.29072512647554805 ], [ "lua", 0.2897133220910624 ], [ "vmw", 0.28364249578414835 ], [ "vie", 0.2789207419898819 ], [ "ibb", 0.23440134907251264 ], [ "azj", 0.2249578414839798 ], [ "pol", 0.2236087689713322 ], [ "bos", 0.2165261382799325 ], [ "slk", 0.20674536256323772 ], [ "hrv", 0.2020236087689713 ], [ "qug", 0.19999999999999996 ], [ "tem", 0.19999999999999996 ], [ "ada", 0.18549747048903875 ], [ "slv", 0.18111298482293425 ], [ "fin", 0.1615514333895447 ], [ "kbp", 0.15210792580101185 ], [ "ibo", 0.13929173693086006 ], [ "yor", 0.127150084317032 ], [ "fon", 0.1183811129848229 ] ]

Only 3 letter code ISO 639 supported

As I and a lot of people use ISO 639-1 as a default way to define languages, it would be handy if there was an option to select this.

Greek (~13 million speakers) not supported in npm franc

I couldn't find greek in the data.json of the the npm version of franc, despite the fact that searches indicate that it has more than 13 million speakers. Did I miss it somewhere?

Numbers of speakers outdated

Currently Hindi is on the 2nd place of the native speakers and not Spanish. On the third place there is English and on the fourth Spanish.

Of course could could also change the numbers to show the total speakers of the language (and not the native ones).

Explain the output of 'all'

The results of 'all' consist of the language code and a score number. I've guessed that the lowest number is the detected language, but what can be learned from the score number? Doesn't seem to be documented.

I'm looking to detect the language of job titles in English and French only (because Canada) and I was getting results all over the place using just franc(jobTitle) but whitelisting english and french then applying a threshold to the score I was able to tune in a much more accurate result (still a 3.92% error rate over 1020 job titles, but it was in the 25% range before the threshold). Is this a good use for the score or am I just getting lucky?

Using franc in the browser?

How would you use franc without npm in the browser, and also how does one turn the 3 letter language code into the full name of the language as seen in the demo?

Should be MIT

It would be really awesome to create a version (?) which MIT/BSD licensed instead of LGPL.

Wrong language detection even for simple texts

I was using franc in my project, but just to discover it is detecting wrong language even for simple texts --

$ franc "Red Deer stags relaxing at sunrise"
dan

While https://github.com/dachev/node-cld and google language detects it correctly --

{ reliable: true,
languages: [ { name: 'ENGLISH', code: 'en', percent: 97, score: 1152 } ],
chunks: [ { name: 'ENGLISH', code: 'en', offset: 0, bytes: 39 } ] }

Text is from - https://www.flickr.com/photos/jellybeanzgallery/23287127571/in/explore-2015-11-28/

Almost got it right in one of the examples

franc('Alle mennesker er født frie og'); //=> 'nno'

This is actually nob (Norwegian Bokmål), not nno (Norwegian Nynorsk) :)

If you finish the sentence it gets it right.

franc('Alle mennesker er født frie og med samme menneskeverd og menneskerettigheter.'; //=> 'nob'

Still, great library!

English and Chinese mixed text results in invalid scottish match with 100% probability

Run the following snippet:

    var
        text    = 'That man is the richest whose pleasure are the cheapest. 能处处寻求快乐的人才是最富有的人。— 梭罗',
        langs   = franc.all(text);

    console.log(langs);

The result is:

    [ [ 'sco', 1 ],
      [ 'eng', 0.9541225122770742 ],
      [ 'src', 0.7208581028689585 ],
      [ 'rmn', 0.7191780821917808 ],
      [ 'nds', 0.7121995347635048 ],
      [ 'ron', 0.6665805117601448 ],
      [ 'hat', 0.665158955802533 ],
      [ 'ita', 0.6585681054536056 ],
      [ 'als', 0.6544326699405532 ],
      [ 'fra', 0.6509433962264151 ],
      [ 'yao', 0.6367278366502973 ],
      [ 'ayr', 0.627681571465495 ],
      [ 'por', 0.6112690617730681 ],
      [ 'afr', 0.608942879296976 ],
      [ 'est', 0.6075213233393641 ],
      [ 'tzm', 0.6062289997415353 ],
      [ 'deu', 0.6039028172654433 ],
      [ 'bug', 0.6032566554665288 ],
      [ 'glg', 0.6000258464719566 ],
      [ 'nld', 0.5965365727578186 ],
      [ 'bin', 0.595890410958904 ],
      [ 'pam', 0.5922719048849832 ],
      [ 'ace', 0.5916257430860687 ],
      [ 'nso', 0.586585681054536 ],
      [ 'mad', 0.5864564486947532 ],
      [ 'nhn', 0.5861979839751874 ],
      [ 'sna', 0.5823210131817007 ],
      [ 'nno', 0.5753424657534247 ],
      [ 'run', 0.5721116567588524 ],
      [ 'cat', 0.5708193331610235 ],
      [ 'epo', 0.5692685448436288 ],
      [ 'ban', 0.569139312483846 ],
      [ 'min', 0.5682346859653657 ],
      [ 'snn', 0.5650038769707935 ],
      [ 'tiv', 0.5580253295425175 ],
      [ 'kin', 0.5569914706642543 ],
      [ 'tpi', 0.5568622383044715 ],
      [ 'tgl', 0.555052985267511 ],
      [ 'spa', 0.5547945205479452 ],
      [ 'gax', 0.553889894029465 ],
      [ 'quz', 0.5494959937968467 ],
      [ 'bci', 0.5478159731196692 ],
      [ 'war', 0.546911346601189 ],
      [ 'ibo', 0.5448436288446628 ],
      [ 'quy', 0.5403204962522616 ],
      [ 'jav', 0.5383820108555182 ],
      [ 'sot', 0.5377358490566038 ],
      [ 'tsn', 0.5373481519772552 ],
      [ 'snk', 0.5356681313000775 ],
      [ 'qug', 0.5339881106229 ],
      [ 'dip', 0.5324373223055052 ],
      [ 'dan', 0.5317911605065908 ],
      [ 'uig', 0.5306280692685448 ],
      [ 'bcl', 0.5273972602739726 ],
      [ 'ckb', 0.5252003101576634 ],
      [ 'hil', 0.5226156629620057 ],
      [ 'ilo', 0.5213233393641767 ],
      [ 'ndo', 0.5201602481261307 ],
      [ 'nya', 0.5160248126130783 ],
      [ 'tur', 0.5104678211424141 ],
      [ 'plt', 0.5089170328250194 ],
      [ 'ceb', 0.5064616179891445 ],
      [ 'aka', 0.5054277591108813 ],
      [ 'nob', 0.5045231325924011 ],
      [ 'ibb', 0.5036185060739209 ],
      [ 'emk', 0.5001292323597829 ],
      [ 'ind', 0.4957353321271647 ],
      [ 'sun', 0.4927629878521582 ],
      [ 'tem', 0.4919875936934608 ],
      [ 'ada', 0.4919875936934608 ],
      [ 'mos', 0.488239855259757 ],
      [ 'kde', 0.488239855259757 ],
      [ 'hau', 0.48216593434996124 ],
      [ 'rmy', 0.4797105195140863 ],
      [ 'hms', 0.47777203411734304 ],
      [ 'fuc', 0.4771258723184285 ],
      [ 'hun', 0.4768674075988627 ],
      [ 'ewe', 0.47389506332385634 ],
      [ 'bam', 0.47118118376841556 ],
      [ 'suk', 0.47066425432928405 ],
      [ 'uzn', 0.4685965365727578 ],
      [ 'tuk', 0.4609718273455673 ],
      [ 'lav', 0.4608425949857844 ],
      [ 'fin', 0.4605841302662187 ],
      [ 'pol', 0.4604548979064358 ],
      [ 'lit', 0.45993796846730417 ],
      [ 'som', 0.45838718014990953 ],
      [ 'xho', 0.4569656241922978 ],
      [ 'azj', 0.45463944171620574 ],
      [ 'vmw', 0.45076247092271904 ],
      [ 'bem', 0.45024554148358753 ],
      [ 'knc', 0.44339622641509435 ],
      [ 'swh', 0.44313776169552854 ],
      [ 'lin', 0.441457741018351 ],
      [ 'vie', 0.44029464978030497 ],
      [ 'ces', 0.44003618506073916 ],
      [ 'toi', 0.43874386146291033 ],
      [ 'zul', 0.4377100025846472 ],
      [ 'slk', 0.43473765830964073 ],
      [ 'ssw', 0.4340914965107263 ],
      [ 'cjk', 0.4334453347118118 ],
      [ 'gaa', 0.43254070819333157 ],
      [ 'men', 0.43228224347376587 ],
      [ 'srp', 0.4302145257172396 ],
      [ 'kbp', 0.4256913931248385 ],
      [ 'bos', 0.42401137244766085 ],
      [ 'lua', 0.4210390281726545 ],
      [ 'lun', 0.41664512794003616 ],
      [ 'hrv', 0.41250969242698377 ],
      [ 'tso', 0.40759886275523394 ],
      [ 'sag', 0.4073403980356681 ],
      [ 'slv', 0.40462651848022746 ],
      [ 'nyn', 0.40372189196174724 ],
      [ 'wol', 0.4025588007237012 ],
      [ 'fon', 0.4011372447660895 ],
      [ 'yor', 0.39622641509433965 ],
      [ 'swe', 0.3900232618247609 ],
      [ 'kng', 0.38097699663995865 ],
      [ 'umb', 0.37645386404755754 ],
      [ 'lug', 0.36495218402688034 ],
      [ 'kmb', 0.3509950891703283 ] ]

Obviously, this is invalid. Also, there is not a single occurance of cmn in the results list.

Czech language detection is not accurate

Results from franc().all() are returning the wrong order of languages. I tried some online language detection tools (also Google Translate) - they all correctly recognize this is Czech.

já tenhle týden dopíšu, nahodím na Intercom Educate (pořídil jsem minulý týden) a bude to na první uživatele

English is not detecting properly.

I have entered "The Cell", it return 'und', or enter "Enzymes" it return 'und'.

Inaccurate detection examples

Here are just a few inaccuracies I've come across testing this package:

franc('iphone unlocked') // returns 'ibb' instead of 'eng'
franc('new refrigerator') // returns 'dan' instead of 'eng'
franc('макбук копмьютер очень хороший') // returns 'kir' instead of 'rus'

Why Chinese is cmn and not chi?

According to standards Chinese should be chi but the library detects it as cmn

Should treat Brazilian Portuguese and European Portuguese normally

Currently they receive lower scores than other languages, but are still returned as first-choice. The latter is correct, but the scores might result in problems for implementors.

Failing in Italian

While writing italian it gives it as less probable

digits are treated as 'cmn'

any reason?

// meanwhile I have misunderstood cmn as common instead of Mandarin Chinese.

Should make data.json smaller

Every trigram has a value, which is exactly its index in the language model. That index could be generated in code.

Problems with franc and Uzbek (uzb, uzn, uzs)

I have implemented and found that uzbek (my native) language is not working properly. I tested with large data-sets. Can I make contribution? Also, there is some issue on naming convention of language code here, 'uzn' (Nothern Uzbek) language has never been in linguistics. But I wonder how it became ISO 639 identifier.

TypeError when language not in whiteList

This fails, while it works if no whiteList is specified (the input is Hebrew):

const franc = require('franc');
var language = franc("הפיתוח הראשוני בשנות ה־80 התמקד בגנו ובמערכת הגרפית",
                     { 'whitelist': ['eng'] });
console.log(language);

The error is:

/home/dnaber/nodejs/node_modules/franc/lib/franc.js:240
var min = distances[0][1];
                      ^
TypeError: Cannot read property '1' of undefined
at normalize (/home/dnaber/nodejs/node_modules/franc/lib/franc.js:240:27)
at detectAll (/home/dnaber/nodejs/node_modules/franc/lib/franc.js:313:12)
at detect (/home/dnaber/nodejs/node_modules/franc/lib/franc.js:325:12)
at Object.<anonymous> (/home/dnaber/crea/firefox-dict-switcher/nodejs/evaluation2.js:3:16)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)

add persian to franc-min

Persian language has more than 110M speakers around the world.

Performance: Exit early when high accuracy is reached

When the probability of a given document for a certain language is over X, the process could exit early and return the language.

It’s at least interesting to research, maybe by splitting the input in parts of N characters/trigrams...

This also touches on another problem: normalising probability-values. Currently, franc returns [languageCode, 1] for a very-probable language, which might be confusing for further processing.

Support ISO 639-1 codes alongside ISO 639-3

The two-letter ISO 639-1 codes are quite widely used, for example in MongoDB's text index. Are there plans for adding support for this language code variant in the future?

Mapping between the two is trivial as a post-process, but creates unnecessary clutter whereas getting the desired language code variant straight from Franc would be very neat and tidy.

Ignored env THRESHOLD

I have setup the

$ export THRESHOLD=100000

but when running the build

$ npm run build

> [email protected] build /Volumes/MacHDD2/Developmemt/Node/franc
> npm run build-bundle && npm run build-mangle && npm run build-fixtures && npm run build-support


> [email protected] build-bundle /Volumes/MacHDD2/Developmemt/Node/franc
> npm run build-bundle-small && npm run build-bundle-large && npm run build-bundle-medium


> [email protected] build-bundle-small /Volumes/MacHDD2/Developmemt/Node/franc
> export THRESHOLD=8000000 && node script/build-languages.js && browserify lib/franc.js --standalone franc --outfile franc.js

so franc is built with a default settings

Franc will be created with support for languages with AT LEAST `8000000` speakers.

Install franc-cli failed

node version v7.7.0

npm -v 4.1.2

run

$ npm install franc-cli --global

npm ERR! enoent ENOENT: no such file or directory, chmod '/Users/FTAndy/.nvm/versions/node/v7.7.0/lib/node_modules/franc-cli/cli.js'
npm ERR! enoent ENOENT: no such file or directory, chmod '/Users/FTAndy/.nvm/versions/node/v7.7.0/lib/node_modules/franc-cli/cli.js'

and idea?

“O Brasil caiu 26 posições em”?

The correct is: “O Brasil caiu em 26 posições”.

Option to return ISO 639-1 codes

Great library!

It would be useful in certain cases to have the option to return ISO 639-1 (i.e., the two letter language code) rather than the three-letter version. Might that be an option?

Consistency on ISO standards for easier integration.

Revisiting #10
I think its great that you support other languages not found in any of the ISO standards.

But to those that can be found, the fact that Franc sometimes returns the 2T and others the 2B , makes it really hard to map without huge lists.

For instance:

arm matches 2B for Armenian but not 2T nor 3 which are 'hye'
ces, on the other hand, matches 2T and 3 while 2B is 'cze'

So it makes for difficult integration with standards that you return one or the other without consistency.

I agree that with languages you wouldn't find, then we must find a solution and it is great! But for those that match, adhering to one or the other would be very helpful.

Thanks, best regards,
Rafa.

Should make footnotes in Supported Languages a list

Accuracy

I'm not reporting an issue at all but I want to know if I'm missing something or what.

Check this out:

franc.all('drink some coffee', { whitelist: ['eng', 'spa'] });

// outputs
[ [ 'spa', 1 ], [ 'eng', 0.949748743718593 ] ]

Where the main competitor cld from Google (the one you mentioned on the README.md) outputs the following:

cld.detect('drink some coffee', function(err, data){  
  return data;
});

// outputs
{ reliable: true,
  textBytes: 19,
  languages: [ { name: 'ENGLISH', code: 'en', percent: 94, score: 1194 } ],
  chunks: [] }

Is this the Franc's accuracy? Because is far beyond to be correct.

The underlying model seems wrong to me

Hi, Could you explain a little bit this function:

function getDistance(trigrams, model) {
    var distance = 0;
    var index = trigrams.length;
    var trigram;
    var difference;

    while (index--) {
        trigram = trigrams[index];

        if (trigram[0] in model) {
            difference = trigram[1] - model[trigram[0]];

            if (difference < 0) {
                difference = -difference;
            }
        } else {
            difference = MAX_DIFFERENCE;
        }

        distance += difference;
    }

    return distance;
}

Especially, I don't get why you do difference = trigram[1] - model[trigram[0]];
Basically you are comparing the number of occurences of a specific trigram, trigram[1], in the input string, with its weight in a specific language model, model[trigram[0]]. And this, for me, doesn't make a lot of sense. Am I getting something wrong here?

For instance I tested it with the simple input "de " which contains the two trigrams "de " and " de". Based on the language models defined in data.json, the expected output should have been "spa" as those two trigrams are in 1st and 3rd positions. However the result is "por", even if these two trigrams are ranked 2nd and 3rd.

Thanks!

Adrien

Could latin be added?

Here's a set of latin text you could use; http://runeberg.org/olmagnus/

Readme with death sign

Do you really have to use † in your Readme. It's Christian and (also) is a symbol for death, so it is probably not the best idea to use it there.
There are better Unicode characters... 😃

Franc ported to Elixir

Hi Titus,

I just ported your great library to Elixir - https://github.com/minibikini/paasaa

Please, let me know if it's ok with you. :)

Add support for BCP 47 and output IANA language subtags

By default, Franc returns ISO-639-3 three-letter language tags, as listed in the Supported Languages table.

We would like Franc to alternatively support outputting IANA language subtags as an option, in compliance with the W3C recommendation for specifying the value of the lang attribute in HTML (and the xml:lang attribute in XML) documents.

(Two- and three-letter) IANA language codes are used as the primary language subtags in the language tag syntax as defined by the IETF’s BCP 47, which may be further specified by adding subtags for “extended language”, script, region, dialect variants, etc. (RFC 5646 describes the syntax in full). The addition of such more fine-grained secondary qualifiers are, I guess, out of Franc’s scope, but it would be very helpful nevertheless when Franc would be able to at least return the IANA primary language tags, which suffice, if used stand-alone, to be still in compliance with the spec.

On the Web — as the IETF and W3C agree — IANA language subtags and BCP 47 seem to be the de facto industry standard (at least more so than ISO 639-3). Moreover, the naming convention for TeX hyphenation pattern files (such as used by i.a. OpenOffice) use ISO-8859-2 codes, which overlap better with IANA language subtags, too.

If Franc would output IANA language subtags, then the return values could be used as-is, and without any further post-processing or re-mapping, in, for example CSS rules, specifying hyphenation:

@media print {
  :lang(nl) { hyphenate-patterns: url(hyphenation/hyph-nl.pat); }
}

@wooorm :

What is the rationale for Franc to default on ISO-639-3 (only)? Is it a “better” standard, and, if so, why?
If you would agree it would be a good idea for Franc to support BCP 47 and outputting IANA language subtags as an available option, then how would you prefer it to be implemented and accept a PR? (We’d happily contribute.) Would it suffice to add and map them in data/support.json?

Input Encoding

Hi,

in docs is not stated, if it is expecting input in some encoding. It is OK, if as text is used any encoding (cp1251, utf16, iso2022jp...?) I got huge db of text files and their encoding is unknown and I would like to know language of them. Thanks.

wooorm / franc Goto Github PK

franc's Introduction

What’s so cool about franc?

What’s not so cool about franc?

Install

Use

API

franc(value[, options])

Parameters

Returns

francAll(value[, options])

Parameters

Returns

Options

options.only

options.ignore

options.minLength

CLI

Data

Supported languages

Language code

Types

Compatibility

Ports

Derivation

Contribute

Security

License

franc's People

Contributors

Stargazers

Watchers

Forkers

franc's Issues

Recommend Projects

Recommend Topics

Recommend Org

`franc(value[, options])`

`francAll(value[, options])`

`Options`

`options.only`

`options.ignore`

`options.minLength`