Giter Site home page Giter Site logo

Comments (8)

nunoperalta avatar nunoperalta commented on July 20, 2024

Also this: "I was sure I was going to talk to"
returns:

array(4) {
    ["bi"]=> float(0.52726326742976)
    ["af"]=> float(0.52575442247659)
    ["jv"]=> float(0.51638917793965)
    ["fy"]=> float(0.51092611862643)
}

from language-detection.

patrickschur avatar patrickschur commented on July 20, 2024

@Coxii and @nunoperalta: The library uses N-grams (310 per language) to recognize a given language and this method doesn't work very well for short sentences like in some of your examples. The number of 310 ngrams are a compromise between speed and accuracy. If you want to improve the detection phase you have to use more ngrams which is also explained in the FAQ.

https://github.com/patrickschur/language-detection#how-can-i-improve-the-detection-phase

@nunoperalta: Here is my result for using 9000 ngrams and your sentence:

Array
(
    [en] => 0.86721031746032
    [af] => 0.83361507936508
    [wo] => 0.8332003968254
    [nl] => 0.82750992063492
)

As you can see the results are much better now. They are still not perfect but that's the problem of using ngrams and short sentences. Because many of the ngrams can also be found in other language profiles.

@Coxii: Your results are looking very strange. Normally the result for each language should be between 0 and 1. Which version of the library are you using and which PHP version?

from language-detection.

EuropeDev avatar EuropeDev commented on July 20, 2024

@Coxii and @nunoperalta: The library uses N-grams (310 per language) to recognize a given language and this method doesn't work very well for short sentences like in some of your examples. The number of 310 ngrams are a compromise between speed and accuracy. If you want to improve the detection phase you have to use more ngrams which is also explained in the FAQ.

https://github.com/patrickschur/language-detection#how-can-i-improve-the-detection-phase

@nunoperalta: Here is my result for using 9000 ngrams and your sentence:

Array
(
    [en] => 0.86721031746032
    [af] => 0.83361507936508
    [wo] => 0.8332003968254
    [nl] => 0.82750992063492
)

As you can see the results are much better now. They are still not perfect but that's the problem of using ngrams and short sentences. Because many of the ngrams can also be found in other language profiles.

@Coxii: Your results are looking very strange. Normally the result for each language should be between 0 and 1. Which version of the library are you using and which PHP version?

Yes because of this problem I run your script on several sentences and from each I took score and substract. To give your script more words. But no, result was still bad, will try that train feature.

from language-detection.

nunoperalta avatar nunoperalta commented on July 20, 2024

@patrickschur

Here is my result for using 9000 ngrams and your sentence:

Thank you very much.

However, I did try that, with 9000, but the results weren't good either. Not sure how you're getting "en" as your first result.
Unless I missed something in the installation of this.

$ld = new Language();
$ld->setMaxNgrams(9000);
var_dump($ld->detect('I was sure I was going to talk to')->close());

["af"]=> float(0.65620634920635)
["jv"]=> float(0.65609722222222)
["bi"]=> float(0.63963492063492)
["fy"]=> float(0.63835119047619)
["tl"]=> float(0.63735317460317)
["en"]=> float(0.62078174603175)

from language-detection.

nunoperalta avatar nunoperalta commented on July 20, 2024

@patrickschur - could it be because the trained files aren't the most up-to-date in the latest package released?

from language-detection.

nunoperalta avatar nunoperalta commented on July 20, 2024

Other examples

Do you ever feel misunderstood?

[es] => 0.48239631336406
[en] => 0.47506912442396
...

How would you describe your personality?

[es] => 0.4746110056926
[ia] => 0.4658064516129
[en] => 0.46550284629981
...

What do u consider a stupid question? Examples..

[oc] => 0.50187623436471
[fr] => 0.49858459512837
[la] => 0.49614878209348
[ca] => 0.49239631336406
[ia] => 0.48844634628045
[es] => 0.47235023041475
[gl] => 0.47146148782093
[pt-BR] => 0.46919025674786
[pt-PT] => 0.462409479921
[en] => 0.43433179723502

from language-detection.

patrickschur avatar patrickschur commented on July 20, 2024

@nunoperalta Which version do you use? I tried to reproduce your issue but I get completely different results. I installed the library (v5.1.0) via composer and tried some of your examples and I get the following results.

Do you ever feel misunderstood?

[en] => 0.81752...
[nl] => 0.80134...

How would you describe your personality?

[en] => 0.85428...
[fr] => 0.73331...

Here is the script I used:

// Must be executed once and can then be removed.
$t = new Trainer();
$t->setMaxNgrams(9000);
$t->learn();

$ld = new Language();
$ld->setMaxNgrams(9000);
  
var_dump($ld->detect('Do you ever feel misunderstood?'));
var_dump($ld->detect('How would you describe your personality?'));

from language-detection.

nunoperalta avatar nunoperalta commented on July 20, 2024

I was using whatever latest version from Composer, using the code I provided.

Unfortunately, I really needed something accurate, so I went to find alternatives.
I'm using now https://github.com/JohnSnowLabs/spark-nlp and it's been working pretty well and accurate.

Best of luck - have a great Christmas holiday and a better new Year 🎄👍

from language-detection.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.