Comments (8)
Also this: "I was sure I was going to talk to"
returns:
array(4) {
["bi"]=> float(0.52726326742976)
["af"]=> float(0.52575442247659)
["jv"]=> float(0.51638917793965)
["fy"]=> float(0.51092611862643)
}
from language-detection.
@Coxii and @nunoperalta: The library uses N-grams (310 per language) to recognize a given language and this method doesn't work very well for short sentences like in some of your examples. The number of 310 ngrams are a compromise between speed and accuracy. If you want to improve the detection phase you have to use more ngrams which is also explained in the FAQ.
https://github.com/patrickschur/language-detection#how-can-i-improve-the-detection-phase
@nunoperalta: Here is my result for using 9000 ngrams and your sentence:
Array
(
[en] => 0.86721031746032
[af] => 0.83361507936508
[wo] => 0.8332003968254
[nl] => 0.82750992063492
)
As you can see the results are much better now. They are still not perfect but that's the problem of using ngrams and short sentences. Because many of the ngrams can also be found in other language profiles.
@Coxii: Your results are looking very strange. Normally the result for each language should be between 0 and 1. Which version of the library are you using and which PHP version?
from language-detection.
@Coxii and @nunoperalta: The library uses N-grams (310 per language) to recognize a given language and this method doesn't work very well for short sentences like in some of your examples. The number of 310 ngrams are a compromise between speed and accuracy. If you want to improve the detection phase you have to use more ngrams which is also explained in the FAQ.
https://github.com/patrickschur/language-detection#how-can-i-improve-the-detection-phase
@nunoperalta: Here is my result for using 9000 ngrams and your sentence:
Array ( [en] => 0.86721031746032 [af] => 0.83361507936508 [wo] => 0.8332003968254 [nl] => 0.82750992063492 )
As you can see the results are much better now. They are still not perfect but that's the problem of using ngrams and short sentences. Because many of the ngrams can also be found in other language profiles.
@Coxii: Your results are looking very strange. Normally the result for each language should be between 0 and 1. Which version of the library are you using and which PHP version?
Yes because of this problem I run your script on several sentences and from each I took score and substract. To give your script more words. But no, result was still bad, will try that train feature.
from language-detection.
Here is my result for using 9000 ngrams and your sentence:
Thank you very much.
However, I did try that, with 9000, but the results weren't good either. Not sure how you're getting "en" as your first result.
Unless I missed something in the installation of this.
$ld = new Language();
$ld->setMaxNgrams(9000);
var_dump($ld->detect('I was sure I was going to talk to')->close());
["af"]=> float(0.65620634920635)
["jv"]=> float(0.65609722222222)
["bi"]=> float(0.63963492063492)
["fy"]=> float(0.63835119047619)
["tl"]=> float(0.63735317460317)
["en"]=> float(0.62078174603175)
from language-detection.
@patrickschur - could it be because the trained files aren't the most up-to-date in the latest package released?
from language-detection.
Other examples
Do you ever feel misunderstood?
[es] => 0.48239631336406
[en] => 0.47506912442396
...
How would you describe your personality?
[es] => 0.4746110056926
[ia] => 0.4658064516129
[en] => 0.46550284629981
...
What do u consider a stupid question? Examples..
[oc] => 0.50187623436471
[fr] => 0.49858459512837
[la] => 0.49614878209348
[ca] => 0.49239631336406
[ia] => 0.48844634628045
[es] => 0.47235023041475
[gl] => 0.47146148782093
[pt-BR] => 0.46919025674786
[pt-PT] => 0.462409479921
[en] => 0.43433179723502
from language-detection.
@nunoperalta Which version do you use? I tried to reproduce your issue but I get completely different results. I installed the library (v5.1.0) via composer and tried some of your examples and I get the following results.
Do you ever feel misunderstood?
[en] => 0.81752...
[nl] => 0.80134...
How would you describe your personality?
[en] => 0.85428...
[fr] => 0.73331...
Here is the script I used:
// Must be executed once and can then be removed.
$t = new Trainer();
$t->setMaxNgrams(9000);
$t->learn();
$ld = new Language();
$ld->setMaxNgrams(9000);
var_dump($ld->detect('Do you ever feel misunderstood?'));
var_dump($ld->detect('How would you describe your personality?'));
from language-detection.
I was using whatever latest version from Composer, using the code I provided.
Unfortunately, I really needed something accurate, so I went to find alternatives.
I'm using now https://github.com/JohnSnowLabs/spark-nlp and it's been working pretty well and accurate.
Best of luck - have a great Christmas holiday and a better new Year 🎄👍
from language-detection.
Related Issues (20)
- Support for Kazakh language
- How can the library detect the wrong language on such simple text? HOT 1
- the word "LOL" is not an english word ? HOT 1
- Compatible for PHP 8 HOT 2
- Language detection with php 5.6 HOT 4
- The detected languages seem wrong very often HOT 2
- where is project amdvbflash? HOT 2
- English text recognition HOT 2
- Feature Request - Min language's values
- What's the right way of checking whether or not the text is in a specific language? HOT 2
- Deprecation notice with PHP 8.1 HOT 4
- What dataset? HOT 2
- Unable to detect Chinese if there is only 1 character HOT 1
- Incorrect language is being returned for specific words
- Testing
- How can I add a new language?
- Can you recommend any article data to train better? The default data is too small
- Is there any way to get the full name of the language along with the language code?
- "ia" language?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from language-detection.