Giter Site home page Giter Site logo

patrickschur / language-detection Goto Github PK

View Code? Open in Web Editor NEW
795.0 29.0 83.0 1.11 MB

A language detection library for PHP. Detects the language from a given text string.

License: MIT License

PHP 100.00%
language language-detection natural-language-processing n-grams php nlp training

language-detection's Introduction

language-detection

Build Status Code Coverage Version Total Downloads Minimum PHP Version License
Build Status codecov Version Total Downloads Minimum PHP Version License

This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in PHP to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase. The library comes with text samples used for training and detecting text in 110 languages.

Table of Contents

Installation with Composer

Note: This library requires the Multibyte String extension in order to work.

$ composer require patrickschur/language-detection

How to upgrade from 3.y.z to 4.y.z?

Important: Only for people who are using a custom directory with their own translation files.

Starting with version 4.y.z we have updated the resource files. For performance reasons we now use PHP instead of JSON as a format. That means people who want to use 4.y.z and used 3.y.z before, have to upgrade their JSON files to PHP. To upgrade your resource files you must generate a language profile again. The JSON files are then no longer needed.

You can delete unnecessary JSON files under Linux with the following command.

rm resources/*/*.json

Basic Usage

To detect the language correctly, the length of the input text should be at least some sentences.

use LanguageDetection\Language;
 
$ld = new Language;
 
$ld->detect('Mag het een onsje meer zijn?')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151,
    "nb" => 0.48849462365591,
    "nn" => 0.48741935483871,
    "fy" => 0.47822580645161,
    "dk" => 0.47172043010753,
    "sv" => 0.46408602150538,
    "bi" => 0.46021505376344,
    "de" => 0.45903225806452,
    [...]
)

API

__construct(array $result = [], string $dirname = '')

You can pass an array of languages to the constructor. To compare the desired sentence only with the given languages. This can dramatically increase the performance. The other parameter is optional and the name of the directory where the translations files are located.

$ld = new Language(['de', 'en', 'nl']);
 
// Compares the sentence only with "de", "en" and "nl" language models.
$ld->detect('Das ist ein Test');

whitelist(string ...$whitelist)

Provide a whitelist. Returns a list of languages, which are required.

$ld->detect('Mag het een onsje meer zijn?')->whitelist('de', 'nn', 'nl', 'af')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "nn" => 0.48741935483871,
    "de" => 0.45903225806452
)

blacklist(string ...$blacklist)

Provide a blacklist. Removes the given languages from the result.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('dk', 'nb', 'de')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151,
    "nn" => 0.48741935483871,
    "fy" => 0.47822580645161,
    "sv" => 0.46408602150538,
    "bi" => 0.46021505376344,
    [...]
)

bestResults()

Returns the best results.

$ld->detect('Mag het een onsje meer zijn?')->bestResults()->close();

Result:

Array
(
    "nl" => 0.66193548387097
)

limit(int $offset, int $length = null)

You can specify the number of records to return. For example the following code will return the top three entries.

$ld->detect('Mag het een onsje meer zijn?')->limit(0, 3)->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151
)

close()

Returns the result as an array.

$ld->detect('This is an example!')->close();

Result:

Array
(
    "en" => 0.5889400921659,
    "gd" => 0.55691244239631,
    "ga" => 0.55376344086022,
    "et" => 0.48294930875576,
    "af" => 0.48218125960061,
    [...]
)

setTokenizer(TokenizerInterface $tokenizer)

The script use a tokenizer for getting all words in a sentence. You can define your own tokenizer to deal with numbers for example.

$ld->setTokenizer(new class implements TokenizerInterface
{
    public function tokenize(string $str): array 
    {
        return preg_split('/[^a-z0-9]/u', $str, -1, PREG_SPLIT_NO_EMPTY);
    }
});

This will return only characters from the alphabet in lowercase and numbers between 0 and 9.


__toString()

Returns the top entrie of the result. Note the echo at the beginning.

echo $ld->detect('Das ist ein Test.');

Result:

de

jsonSerialize()

Serialized the data to JSON.

$object = $ld->detect('Tere tulemast tagasi! Nägemist!');
 
json_encode($object, JSON_PRETTY_PRINT);

Result:

{
    "et": 0.5224748810153358,
    "ch": 0.45817028027498674,
    "bi": 0.4452670544685352,
    "fi": 0.440983606557377,
    "lt": 0.4382866208355367,
    [...]
}

Method chaining

You can also combine methods with each other. The following example will remove all entries specified in the blacklist and returns only the top four entries.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->limit(0, 4)->close();

Result:

Array
(
    "nl" => 0.66193548387097
    "br" => 0.49634408602151
    "nb" => 0.48849462365591
    "nn" => 0.48741935483871
)

ArrayAccess

You can also access the object directly as an array.

$object = $ld->detect(Das ist ein Test');
 
echo $object['de'];
echo $object['en'];
echo $object['xy']; // does not exists

Result:

0.6623339658444
0.56859582542694
NULL

Supported languages

The library currently supports 110 languages. To get an overview of all supported languages please have a look at here.


Other languages

The library is trainable which means you can change, remove and add your own language files to it. If your language not supported, feel free to add your own language files. To do that, create a new directory in resources and add your training text to it.

Note: The training text should be a .txt file.

Example

|- resources
    |- ham
        |- ham.txt
    |- spam
        |- spam.txt

As you can see, we can also used it to detect spam or ham.

When you stored your translation files outside of resources, you have to specify the path.

$t->learn('YOUR_PATH_HERE');

Whenever you change one of the translation files you must first generate a language profile for it. This may take a few seconds.

use LanguageDetection\Trainer;
 
$t = new Trainer();
 
$t->learn();

Remove these few lines after execution and now we can classify texts by their language with our own training text.


FAQ

How can I improve the detection phase?

To improve the detection phase you have to use more n-grams. But be careful this will slow down the script. I figured out that the detection phase is much better when you are using around 9.000 n-grams (default is 310). To do that look at the code right below:

$t = new Trainer();
 
$t->setMaxNgrams(9000);
 
$t->learn();

First you have to train it. Now you can classify texts like before but you must specify how many n-grams you want to use.

$ld = new Language();
 
$ld->setMaxNgrams(9000);
  
// "grille pain" is french and means "toaster" in english
var_dump($ld->detect('grille pain')->bestResults());

Result:

class LanguageDetection\LanguageResult#5 (1) {
  private $result =>
  array(2) {
    'fr' =>
    double(0.91307037037037)
    'en' =>
    double(0.90623333333333)
  }
}

Is the detection process slower if language files are very big?

No it is not. The trainer class will only use the best 310 n-grams of the language. If you don't change this number or add more language files it will not affect the performance. Only creating the N-grams is slower. However, the creation of N-grams must be done only once. The detection phase is only affected when you are trying to detect big chunks of texts.

Summary: The training phase will be slower but the detection phase remains the same.

Contributing

Feel free to contribute. Any help is welcome.

License

This projects is licensed under the terms of the MIT license.

language-detection's People

Contributors

arsonik avatar dayvsonsales avatar drowe-wayfair avatar gradzio avatar iquito avatar joycebabu avatar matthewnessworthy avatar mejans avatar patrickschur avatar pierstoval avatar stof avatar toflar avatar tomasliubinas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

language-detection's Issues

Not correct detect Japanese

$ld = new Language(['ja']);
$ld->setMaxNgrams(9000);

var_dump($ld->detect('タイトティーンアクション'));

All results is null, but language is Japanese

TypeError: Return value of LanguageDetection\LanguageResult::__toString() must be of the type string, null returned

Error caused in https://github.com/patrickschur/language-detection/blob/master/src/LanguageDetection/LanguageResult.php#L86

    /**
     * @return string
     */
    public function __toString(): string
    {
        return key($this->result);
    }

Function key() return null with an empty array.

An example case is when language input is number, e.g. echo (string) $ld->detect('1992');

I will send pull request if you need.
Thank you.

English text recognition

Hello,

This text is in English but I get this as a result.

xray no change + moderate non irradating pain nxray no change + dull pain nxray no change + biting

{ "ug-Latn": 0.422791679228218253427939998800866305828094482421875, "ch": 0.42007838408200182112040010906639508903026580810546875, "en": 0.412933373530298464260113178170286118984222412109375, "tl": 0.40669279469400054782823872301378287374973297119140625 }

Script states that the text is in Chinese before English.
What is the reason of this?

Grille pain

Guten Tag,

I continue in english ;)

I compared all current language detectors based on sequence of N-grams and your solution is the best implementation.

However I have to translate very short sequences of words. For example the words 'Grille pain' which means toaster in in english returns

array(6) {
  ["it"]=>
  float(0.54711111111111)
  ["fr"]=>
  float(0.54633333333333)
  ["en"]=>
  float(0.506)
  ["de"]=>
  float(0.49488888888889)
  ["nl"]=>
  float(0.49)
  ["es"]=>
  float(0.43466666666667)
}

It's almost good! it wins on fr by a difference 0.000777.

So, I added words 'Grille pain' in fr file, set trainer and new result gives:

array(6) {
  ["fr"]=>
  float(0.54944444444444)
  ["it"]=>
  float(0.54711111111111)
  ["en"]=>
  float(0.506)
  ["de"]=>
  float(0.49488888888889)
  ["nl"]=>
  float(0.49)
  ["es"]=>
  float(0.43466666666667)
}

It's good. fr wins on it by a difference of 0.002333.

So my questions ,

1- Can I populate language files so that I can be sure that the first occurence wins with a very significant difference.

2- If yes to previous question, is the detection process slower if language files are very big?

3- I can see that, in fr file, you have put french declaration of rights. I dont think that a 200 years old text represents very well current french language. Is there somewhere some data which may be more accurate?

Thanks for your great job.

Best Regards
Michel

Possible improvement with spanish vs french

Hi,

I started using this library and it works great most of the time but I came across a problem with the following sentence in spanish: "Este es un mensaje de prueba"
That results in the following scores:
[fr] => 0.56424275560416
[es] => 0.54543466375068
[pt-PT] => 0.52006560962274
[pt-BR] => 0.51919081465282

Spanish should be the first one, I know that maybe its a very short text but maybe you can improve something here.

Cheers

Negative language probability

I try to improve language detection and set separate folder with samples as mentioned

$t = new LanguageDetection\Trainer();
$t->setMaxNgrams(9000);
$t->learn(/project/language/samples');

So it created json files in language directories
But when I try to detect language:

$ld = new LanguageDetection\Language([],/project/language/samples');
$ld->detect('some text here')->close()

I got negative probability

[
     "bg" => -0.63268817204301,
     "ru" => -1.183311827957,
 ]

So if used bestResults(), the wrong language code returns. Text in my case is russian.

Is it normal that negative probability is returned?

How can I add a new language?

Discussed in #59

Originally posted by marcovlesmes February 2, 2024
Hi, I am trying to incorporate a new language in the library, I have already added a .txt file in resources with the structure:
resources\nhe\nhe.txt
I trained the library and tried to detect the language of a phrase; however, the language does not appear in the list of languages. Does the new language need to be registered somewhere?

The detected languages seem wrong very often

Hi.

I am trying out this library and it seems that I am getting wrong language detections.
After training the library with 9.000 n-grams, I tested this code:

use LanguageDetection\Language;

$ld = new Language;
$ld->setMaxNgrams(9000);

var_export($ld->detect('Je souhaite annuler mon abonnement')->limit(0, 3)->close());
var_export($ld->detect('Merci beaucoup')->limit(0, 3)->close());

I was expecting both sentences to be detected with French at number 1, but I got this:

array (
  'fr' => 0.83144592592593,
  'de' => 0.81363111111111,
  'en' => 0.80018962962963,
);
array (
  'en' => 0.8102134502924,
  'fr' => 0.80741520467836,
  'ca' => 0.78646198830409,
);

Note how in the second call, French is number 2 and English is number 1.
What is causing this and how can the system be improved to get better results?

Thanks in advance

Incorrect language is being returned for specific words

Hi,

The words "System Scaling Strategy" will return anything but English it seems. I've had West Frisian, Afrikaans and now Swedish as the "best result".

This is the output of language detection:
( 'sv' => 0.4363359707851491, 'nl' => 0.43365794278758374, 'hu' => 0.4318320146074255, 'nb' => 0.42452830188679247, 'de' => 0.4225197808886184, 'da' => 0.4211199026171637, 'en' => 0.41880706025562997, 'la' => 0.4111990261716373, 'pt-BR' => 0.4084601339013999, 'pt-PT' => 0.40833840535605603, 'id' => 0.40626902008520993, 'is' => 0.3883749239196591, 'lt' => 0.38758368837492396, 'sl' => 0.38533171028606206, 'ro' => 0.3756542909312234, 'es' => 0.37321972002434567, 'it' => 0.3685940353012781, 'et' => 0.3680462568472307, 'cs' => 0.3612903225806452, 'cy' => 0.3496652465003043, 'pl' => 0.3467437614120511, 'bs-Latn' => 0.3446743761412051, 'lv' => 0.3444917833231893, 'fr' => 0.3375532562385879, 'tr' => 0.33609251369446136, 'hr' => 0.33049300060864273, 'sk' => 0.32860620815581254, 'fi' => 0.3032258064516129, 'gd' => 0.29519172245891656, 'ga' => 0.28387096774193543, 'ka' => 0, 'ml' => 0, 'hy' => 0, 'el-polyton' => 0, 'el-monoton' => 0, 'uk' => 0, 'zh-Hans' => 0, 'zh-Hant' => 0, )

I limited the languages to exclude Frisian and Afrikaans. Why is English not the first in the list?

What dataset?

Hello there, may I ask what database has been used to create this library?

Grille pain question 4

Hi,

Sorry, I forgot question 4:

'xzy' returns

array(6) {
  ["it"]=>
  float(0.092666666666667)
  ["en"]=>
  float(0.092)
  ["nl"]=>
  float(0.088)
  ["de"]=>
  float(0.084666666666667)
  ["es"]=>
  float(0.081666666666667)
  ["fr"]=>
  float(0.043333333333333)
}

I can not launch translation with that.

-Could your briefly explain what this figures are and at which level one can say that they are reliable enough to return a detected language. You could add a bestResult() method 'winner by KO' (like germany brasil 7-0 ;).

-Maybe more interesting for developers, would be to add a validate() method which returns false if we definitely can not detect. I will insert it in my validation process.

Thanks again.
Michel

Create langLibrary form different directories

Hello,

Now, when we create a library, we can use following ways:

  1. new Language() - following directory will be used (by default): DIR . '/../../resources//.json';

  2. We use other .json-files (take as an example, en language):
    2.1 Create at any place folder $dirname=LanguageDetection/en/.
    2.2 Put there your own text file: en.txt.
    2.3 Train library:

          $t = new Trainer();
          $t->learn($dirname);
    

    2.4 Then use newly created/updated .json-file: LanguageDetection/en/en.json by:

new Language([], $dirname)

So, if we want to use default lang file, we should:

  1. Copy already existing en.txt to our newly created folder.
  2. Add our text to existing
  3. Train library.
  4. Use newly created/updated en.json

Request:

Not to copy-paste, it would be good have a possibility to use already existing en.json and newly created together, something like:
new Language([], $dirname, $useDefaultFile = true):

  • 3rd params by default is false
  • if dirname is defined and $useDefaultFile=true: use 2 path together - default one (
    __DIR__ . '/../../resources/*/*.json'_)
    and new - dirname

Trainer echo?

Hi,

I'm using the Trainer class like so:

$languageTrainer = new Trainer();
$languageTrainer->setMaxNgrams(9000);
$languageTrainer->learn();

I wasn't expecting it to echo anything - why is it not returning a success / failure?

image

Also, GlobIterator expects 2 parameters, 1 given.

image

Source of language datasets

Where is the source text dataset for the Ngrams of those 110 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

__toString() must be of the type string, null returned

PHP 7.0 + Laravel 5.5

$language = $this->languageDetect->detect($rankedKeyword->query)->__toString();

[2017-09-22 15:40:33] local.ERROR: Type error: Return value of LanguageDetection\LanguageResult::__toString() must be of the type string, null returned {"exception":"[object] (Symfony\Component\Debug\Exception\FatalThrowableError(code: 0): Type error: Return value of LanguageDetection\LanguageResult::__toString() must be of the type string, null returned at /var/www/html/meek.com.cn/trendx-crawler/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php:86)
[stacktrace]
#0 /var/www/html/meek.com.cn/trendx-crawler/app/GoogleTrendsCrawler.php(212): LanguageDetection\LanguageResult->__toString()
#1 /var/www/html/meek.com.cn/trendx-crawler/app/Console/Commands/Crawl.php(53): App\GoogleTrendsCrawler->fetchQueries(Array)
#2 [internal function]: App\Console\Commands\Crawl->handle()
#3 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(29): call_user_func_array(Array, Array)
#4 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(87): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}()
#5 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(31): Illuminate\Container\BoundMethod::callBoundMethod(Object(Illuminate\Foundation\Application), Array, Object(Closure))
#6 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/Container.php(549): Illuminate\Container\BoundMethod::call(Object(Illuminate\Foundation\Application), Array, Array, NULL)
#7 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Console/Command.php(180): Illuminate\Container\Container->call(Array)
#8 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Command/Command.php(264): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#9 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Console/Command.php(167): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#10 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Application.php(888): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Application.php(224): Symfony\Component\Console\Application->doRunCommand(Object(App\Console\Commands\Crawl), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Application.php(125): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Console/Application.php(88): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(121): Illuminate\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#15 /var/www/html/meek.com.cn/trendx-crawler/artisan(37): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#16 {main}
"}

Language detection with php 5.6

Hey,
I've been trying to do some basic operations (as described in readme file), but it simply doesn't work.
I've been using PHP 5.6 for an old project and I removed those operators, which are newer (I've error reporting to E_ALL) and it doesn't have any errors, but it returns an empty array when calling this:
$ld->detect('Mag het een onsje meer zijn?')->close();

Any ideas what I'm doing wrong or perhaps if there are any problems with older versions of PHP?

Feature Request - Min language's values

Currently we can use the limit function to return a specific quantity of languages:

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->limit(0, 4)->close();

Array
(
    "nl" => 0.66193548387097
    "br" => 0.49634408602151
    "nb" => 0.48849462365591
    "nn" => 0.48741935483871
)

Would be nice to have a standalone function in the library to limit the results by its values.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->min(0.5)->close(); // or atLeast() instead of min()

Array
(
    "nl" => 0.66193548387097
)

// In case of a greater than number:

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->min(1)->close(); // or atLeast() instead of min()

Array
(
)

need a new release

I was using composer install this package, but i still has this error

PHP Fatal error:  Uncaught TypeError: array_merge() expects at least 1 parameter, 0 given in /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/NgramParser.php:139
Stack trace:
#0 /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/NgramParser.php(139): array_merge()
#1 /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/Language.php(50): LanguageDetection\NgramParser->getNgrams('\xEF\xBC\x9F\xEF\xBC\x9F')
#2 /home/.../Fetchers/Mapping.php(140): LanguageDetection\Language->detect('\xEF\xBC\x9F\xEF\xBC\x9F')
#3 /home/.../test/index.php(31): Fetchers\Mapping::languageDetect('\xEF\xBC\x9F\xEF\xBC\x9F')
#4 {main}
  thrown in /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/NgramParser.php on line 139

but this bug is fixed by this commit 610e8a1

when run composer update patrickschur/language-detection the code still not updated

How can the library detect the wrong language on such simple text?

I tried the library on the following sentence: "Please delete my account."
The identified languages that i got are listed below.
English came in 5th!

BTW, I tried this with quite a few English sentences, and the results were not very good...

Am i using the library wrong?
This is what I did:

<?php
$detector = new LanguageDetection\Language();
var_export($detector->detect("Please delete my account.")->close());
?>

array (
'fr' => 0.50662139219015,
'es' => 0.50311262026033,
'ca' => 0.49643463497453,
'ro' => 0.48630447085456,
'en' => 0.48381437464629,
'ia' => 0.47747594793435,
'gl' => 0.46955291454443,
'wa' => 0.46768534238823,
'nb' => 0.45551782682513,
'it' => 0.45297113752122,
'ch' => 0.44923599320883,
'pt-PT' => 0.44069043576684,
'da' => 0.44063384267119,
'pt-BR' => 0.43848330503679,
'fy' => 0.43746462931522,
'nn' => 0.43497453310696,
'af' => 0.4343520090549,
'br' => 0.43033389926429,
'sv' => 0.42959818902094,
'ga' => 0.42716468590832,
'eo' => 0.42648556876061,
'la' => 0.4258064516129,
'io' => 0.42354272778721,
'de' => 0.42195812110922,
'et' => 0.42167515563101,
'hu' => 0.41680814940577,
'gn' => 0.41533672891907,
'to' => 0.41041312959819,
'id' => 0.41001697792869,
'tr' => 0.40288624787776,
'cy' => 0.39711375212224,
'ss' => 0.39400113186191,
'ug-Latn' => 0.39383135257499,
'gd' => 0.3933220147142,
'nl' => 0.39286926994907,
'ms-Latn' => 0.38828522920204,
'ku' => 0.38760611205433,
'sq' => 0.38664402942841,
'eu' => 0.38075834748161,
'cs' => 0.37826825127334,
'fi' => 0.37549518958687,
'wo' => 0.37006225240521,
'bi' => 0.36949632144878,
'xh' => 0.35529145444256,
'ln' => 0.35466893039049,
'sl' => 0.35336728919072,
'jv' => 0.35217883418223,
'sk' => 0.34821731748727,
've' => 0.33938879456706,
'ng' => 0.32823995472552,
'ig' => 0.32591963780419,
'pl' => 0.32382569326542,
'mt' => 0.32252405206565,
'co' => 0.32116581777023,
'lv' => 0.3184493491794,
'lt' => 0.31839275608376,
'sr-Latn' => 0.31618562535371,
'lg' => 0.31601584606678,
'tl' => 0.31494057724958,
'so' => 0.31290322580645,
'bs-Latn' => 0.30820599886814,
'hr' => 0.29881154499151,
'fj' => 0.29626485568761,
'ay' => 0.29111488398415,
'vi' => 0.28726655348048,
'az-Latn' => 0.28053197509904,
'kr' => 0.27549518958687,
'is' => 0.27464629315224,
'fo' => 0.26779852857951,
'yo' => 0.26445953593662,
'ha' => 0.26225240520656,
'mh' => 0.26140350877193,
'ty' => 0.22948500282965,
'nv' => 0.21522354272779,
'zh-Hans' => 0,
'sr-Cyrl' => 0,
'uz' => 0,
'ur' => 0,
'uk' => 0,
'ug-Arab' => 0,
'ta' => 0,
'th' => 0,
'tt' => 0,
'ab' => 0,
'sa' => 0,
'el-polyton' => 0,
'am' => 0,
'ar' => 0,
'az-Cyrl' => 0,
'be' => 0,
'bg' => 0,
'bn' => 0,
'bo' => 0,
'bs-Cyrl' => 0,
'cr' => 0,
'dz' => 0,
'el-monoton' => 0,
'fa' => 0,
'ru' => 0,
'gu' => 0,
'he' => 0,
'hi' => 0,
'hy' => 0,
'iu' => 0,
'ja' => 0,
'ka' => 0,
'km' => 0,
'ko' => 0,
'lo' => 0,
'mn-Cyrl' => 0,
'ms-Arab' => 0,
'zh-Hant' => 0,
)

Make location of the language resources configurable

It would be nice if the location of the resources would be configurable. Currently the Trainer and Language classes are bound to the resources folder. It would make managing the trainer files a bit easier.

Thanks :)

Deprecation notice with PHP 8.1

In PHP 8.1 a lot of interfaces got updated with new return types. That includes ArrayAccess. Therefore LanguageResult produces some little notices on PHP 8.1:

Deprecated: Return type of LanguageDetection\LanguageResult::offsetExists($offset) should either be compatible with ArrayAccess::offsetExists(mixed $offset): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 37
Deprecated: Return type of LanguageDetection\LanguageResult::offsetGet($offset) should either be compatible with ArrayAccess::offsetGet(mixed $offset): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 46
Deprecated: Return type of LanguageDetection\LanguageResult::offsetSet($offset, $value) should either be compatible with ArrayAccess::offsetSet(mixed $offset, mixed $value): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 56
Deprecated: Return type of LanguageDetection\LanguageResult::offsetUnset($offset) should either be compatible with ArrayAccess::offsetUnset(mixed $offset): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 68

version: v5.1.0

Issue with detection of English text

Phrase : A beautiful villa in eastern Sweden --- Language Detected : sv

Phrase : Uma bela casa no leste da Suécia --- Language Detected : pt-BR
Phrase : Eine schöne Villa in Ost-Schweden --- Language Detected : de
Phrase : En vacker villa i östra sverige --- Language Detected : sv
Phrase : পূর্ব সুইডেনে একটি সুন্দর বাগানবাড়ি --- Language Detected : bn
Phrase : Une belle villa en Suède orientale --- Language Detected : it
Phrase : فيلا جميلة في شرق السويد --- Language Detected : ar
Phrase : En vakker villa i Øst-Sverige --- Language Detected : sv
Phrase : Una hermosa villa en el este de Suecia --- Language Detected : es
Phrase : En smuk villa i det østlige Sverige --- Language Detected : dk
Phrase : Kaunis huvila Itä Ruotsissa --- Language Detected : fi
Phrase : पूर्वी स्वीडन में एक खूबसूरत विला --- Language Detected : hi
Phrase : Una bella villa in Svezia orientale --- Language Detected : it
Phrase : 東部スウェーデンの美しいヴィラ --- Language Detected : ja
Phrase : O vilă frumoasă în estul Suediei --- Language Detected : ro
Phrase : Isang magandang villa sa silangang Sweden --- Language Detected : jv
Phrase : Красивая вилла в восточной части Швеции --- Language Detected : ru
Phrase : Красива вілла в східній частині Швеції --- Language Detected : uk
Phrase : A beautiful villa in eastern Sweden --- Language Detected : sv
Phrase : Uma bela casa no leste da Suécia --- Language Detected : pt-BR
Phrase : Eine schöne Villa in Ost-Schweden --- Language Detected : de
Phrase : En vacker villa i östra sverige --- Language Detected : sv
Phrase : পূর্ব সুইডেনে একটি সুন্দর বাগানবাড়ি --- Language Detected : bn
Phrase : Une belle villa en Suède orientale --- Language Detected : it
Phrase : فيلا جميلة في شرق السويد --- Language Detected : ar
Phrase : En vakker villa i Øst-Sverige --- Language Detected : sv
Phrase : Una hermosa villa en el este de Suecia --- Language Detected : es
Phrase : En smuk villa i det østlige Sverige --- Language Detected : dk
Phrase : Kaunis huvila Itä Ruotsissa --- Language Detected : fi
Phrase : पूर्वी स्वीडन में एक खूबसूरत विला --- Language Detected : hi
Phrase : Una bella villa in Svezia orientale --- Language Detected : it
Phrase : 東部スウェーデンの美しいヴィラ --- Language Detected : ja
Phrase : O vilă frumoasă în estul Suediei --- Language Detected : ro
Phrase : Isang magandang villa sa silangang Sweden --- Language Detected : jv
Phrase : Красивая вилла в восточной части Швеции --- Language Detected : ru
Phrase : Красива вілла в східній частині Швеції --- Language Detected : uk
Phrase : A beautiful villa in eastern Sweden --- Language Detected : sv
Phrase : Uma bela casa no leste da Suécia --- Language Detected : pt-BR
Phrase : Eine schöne Villa in Ost-Schweden --- Language Detected : de
Phrase : En vacker villa i östra sverige --- Language Detected : sv
Phrase : পূর্ব সুইডেনে একটি সুন্দর বাগানবাড়ি --- Language Detected : bn
Phrase : Une belle villa en Suède orientale --- Language Detected : it
Phrase : فيلا جميلة في شرق السويد --- Language Detected : ar
Phrase : En vakker villa i Øst-Sverige --- Language Detected : sv
Phrase : Una hermosa villa en el este de Suecia --- Language Detected : es
Phrase : En smuk villa i det østlige Sverige --- Language Detected : dk
Phrase : Kaunis huvila Itä Ruotsissa --- Language Detected : fi
Phrase : पूर्वी स्वीडन में एक खूबसूरत विला --- Language Detected : hi
Phrase : Una bella villa in Svezia orientale --- Language Detected : it
Phrase : 東部スウェーデンの美しいヴィラ --- Language Detected : ja
Phrase : O vilă frumoasă în estul Suediei --- Language Detected : ro
Phrase : Isang magandang villa sa silangang Sweden --- Language Detected : jv
Phrase : Красивая вилла в восточной части Швеции --- Language Detected : ru
Phrase : Красива вілла в східній частині Швеції --- Language Detected : uk

Above are the phrase and the language detected. The library seems to be working fine, but its not detecting the correct for simple english text.

What's the right way of checking whether or not the text is in a specific language?

Hi! I'd like to use this library to check whether or not a (plain) text is in English.

What I'd like to do is something like:

if ((new LanguageDetector(['en']))->detect($str)->close()['en'] < THRESHOLD)
  echo 'It surely is not in English';

It seems to me that the returned score isn't a probability, though, because it changes based on the maxNgrams used (and sometimes is negative). This makes me think that the "threshold" solution above may not be valid.

I might be able to find the answer on my own by digging a bit more into the code but I thought that an answer from the contributors could benefit other users too. Plus, it would be good to add an explanation about what the scores are in the README.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.