I have tried this on few strings, to find out score, all strings are clearly english a

Also this: "I was sure I was going to talk to" returns: <div class="snippet-cl

@Coxii and <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

@Coxii and <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Other examples Do you ever feel misunderstood? </bl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Detection of english string does not work correctly about language-detection HOT 8 CLOSED

EuropeDev commented on July 20, 2024

Detection of english string does not work correctly

from language-detection.

Comments (8)

nunoperalta commented on July 20, 2024

Also this: "I was sure I was going to talk to"
returns:

array(4) {
    ["bi"]=> float(0.52726326742976)
    ["af"]=> float(0.52575442247659)
    ["jv"]=> float(0.51638917793965)
    ["fy"]=> float(0.51092611862643)
}

from language-detection.

patrickschur commented on July 20, 2024

@Coxii and @nunoperalta: The library uses N-grams (310 per language) to recognize a given language and this method doesn't work very well for short sentences like in some of your examples. The number of 310 ngrams are a compromise between speed and accuracy. If you want to improve the detection phase you have to use more ngrams which is also explained in the FAQ.

https://github.com/patrickschur/language-detection#how-can-i-improve-the-detection-phase

@nunoperalta: Here is my result for using 9000 ngrams and your sentence:

Array
(
    [en] => 0.86721031746032
    [af] => 0.83361507936508
    [wo] => 0.8332003968254
    [nl] => 0.82750992063492
)

As you can see the results are much better now. They are still not perfect but that's the problem of using ngrams and short sentences. Because many of the ngrams can also be found in other language profiles.

@Coxii: Your results are looking very strange. Normally the result for each language should be between 0 and 1. Which version of the library are you using and which PHP version?

from language-detection.

EuropeDev commented on July 20, 2024

@Coxii and @nunoperalta: The library uses N-grams (310 per language) to recognize a given language and this method doesn't work very well for short sentences like in some of your examples. The number of 310 ngrams are a compromise between speed and accuracy. If you want to improve the detection phase you have to use more ngrams which is also explained in the FAQ.

https://github.com/patrickschur/language-detection#how-can-i-improve-the-detection-phase

@nunoperalta: Here is my result for using 9000 ngrams and your sentence:
Array
(
    [en] => 0.86721031746032
    [af] => 0.83361507936508
    [wo] => 0.8332003968254
    [nl] => 0.82750992063492
)
As you can see the results are much better now. They are still not perfect but that's the problem of using ngrams and short sentences. Because many of the ngrams can also be found in other language profiles.

@Coxii: Your results are looking very strange. Normally the result for each language should be between 0 and 1. Which version of the library are you using and which PHP version?

Yes because of this problem I run your script on several sentences and from each I took score and substract. To give your script more words. But no, result was still bad, will try that train feature.

from language-detection.

nunoperalta commented on July 20, 2024

@patrickschur

Here is my result for using 9000 ngrams and your sentence:

Thank you very much.

However, I did try that, with 9000, but the results weren't good either. Not sure how you're getting "en" as your first result.
Unless I missed something in the installation of this.

$ld = new Language();
$ld->setMaxNgrams(9000);
var_dump($ld->detect('I was sure I was going to talk to')->close());

["af"]=> float(0.65620634920635)
["jv"]=> float(0.65609722222222)
["bi"]=> float(0.63963492063492)
["fy"]=> float(0.63835119047619)
["tl"]=> float(0.63735317460317)
["en"]=> float(0.62078174603175)

from language-detection.

nunoperalta commented on July 20, 2024

@patrickschur - could it be because the trained files aren't the most up-to-date in the latest package released?

from language-detection.

nunoperalta commented on July 20, 2024

Other examples

Do you ever feel misunderstood?

[es] => 0.48239631336406
[en] => 0.47506912442396
...

How would you describe your personality?

[es] => 0.4746110056926
[ia] => 0.4658064516129
[en] => 0.46550284629981
...

What do u consider a stupid question? Examples..

[oc] => 0.50187623436471
[fr] => 0.49858459512837
[la] => 0.49614878209348
[ca] => 0.49239631336406
[ia] => 0.48844634628045
[es] => 0.47235023041475
[gl] => 0.47146148782093
[pt-BR] => 0.46919025674786
[pt-PT] => 0.462409479921
[en] => 0.43433179723502

from language-detection.

patrickschur commented on July 20, 2024

@nunoperalta Which version do you use? I tried to reproduce your issue but I get completely different results. I installed the library (v5.1.0) via composer and tried some of your examples and I get the following results.

Do you ever feel misunderstood?

[en] => 0.81752...
[nl] => 0.80134...

How would you describe your personality?

[en] => 0.85428...
[fr] => 0.73331...

Here is the script I used:

// Must be executed once and can then be removed.
$t = new Trainer();
$t->setMaxNgrams(9000);
$t->learn();

$ld = new Language();
$ld->setMaxNgrams(9000);
  
var_dump($ld->detect('Do you ever feel misunderstood?'));
var_dump($ld->detect('How would you describe your personality?'));

from language-detection.

nunoperalta commented on July 20, 2024

I was using whatever latest version from Composer, using the code I provided.

Unfortunately, I really needed something accurate, so I went to find alternatives.
I'm using now https://github.com/JohnSnowLabs/spark-nlp and it's been working pretty well and accurate.

Best of luck - have a great Christmas holiday and a better new Year 🎄👍

from language-detection.

Detection of english string does not work correctly about language-detection HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent