Comments (3)
Pure statistical approaches to language detection are never 100% correct. Look at the confidence values, Russian is only slightly behind Macedonian. Based on the training data I've used, some of the letter sequences are slightly more likely to occur in Macedonian than in Russian.
Language.MACEDONIAN: 0.2627280495188072
Language.RUSSIAN: 0.25885698169328053
Language.SERBIAN: 0.2296931907029266
Language.BULGARIAN: 0.14850396414264333
Language.BELARUSIAN: 0.04966736018194442
Language.UKRAINIAN: 0.023019779852873307
Language.MONGOLIAN: 0.015713463654129702
Language.KAZAKH: 0.011817210253394902
Feed longer strings into the detector. Then you will get more reliable results. An interesting approach to solve this problem has been proposed in #101. I will investigate whether changing the probabilities in the mentioned way will produce more accurate results.
from lingua-py.
Thanks @pemistahl. The issue for me was that this specific string does not appear to be dramatically correct Macedonian. I don't quite understand the algo used to point out why this is happening though
from lingua-py.
Based on the training data I've used, the letter sequences in the text "как дела"
are slightly more likely to occur in Macedonian than in Russian. The library does not know anything about semantics, i.e. the meaning of the words. It's all about statistics, i.e. probabilities for certain letter sequences, also called n-grams.
I've briefly explained the algorithm in section 5 of the readme.
from lingua-py.
Related Issues (20)
- `compute_language_confidence_values_in_parallel` crashes with big dataset HOT 11
- Offsets incorrect HOT 4
- Can you release the Python source code?
- Language recognition fails for programming language code HOT 8
- Add type annotations to v2.x HOT 3
- Convert language to ISO 639-1 language code HOT 4
- TypeError: cannot pickle `Language` object with v2.0.1 HOT 4
- ISO Codes
- CHINESE detect error HOT 5
- Crash on particular emoji with detect_multiple_languages HOT 3
- detect_multiple_languages_of crashes on Arabic HOT 1
- Yanked versions HOT 4
- Readme file too long for Azure Artifacts HOT 6
- Add v2+ support for Alpine Linux by providing `musllinux` wheels HOT 5
- High-confidence false detections on text from webpages that contain many languages HOT 1
- detect_multiple_languages_of() does not work at all for mixed English, Chinese and Japanese HOT 3
- Single word greeting detection issue HOT 3
- Cannot convert IsoCode or language string into the actual type
- Accuracy comparision between lingua and fasttext HOT 3
- Multiple Function result discrepancy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lingua-py.