patrickschur / language-detection Goto Github PK

A language detection library for PHP. Detects the language from a given text string.

License: MIT License

PHP 100.00%

language language-detection natural-language-processing n-grams php nlp training

language-detection's Introduction

language-detection

Build Status	Code Coverage	Version	Total Downloads	Minimum PHP Version	License

This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in PHP to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase. The library comes with text samples used for training and detecting text in 110 languages.

Installation with Composer
How to upgrade from 3.y.z to 4.y.z?
Basic Usage
API
Method Chaining
Array Access
List of supported languages
Other languages
FAQ
Contributing
License

Installation with Composer

Note: This library requires the Multibyte String extension in order to work.

$ composer require patrickschur/language-detection

How to upgrade from `3.y.z` to `4.y.z`?

Important: Only for people who are using a custom directory with their own translation files.

Starting with version 4.y.z we have updated the resource files. For performance reasons we now use PHP instead of JSON as a format. That means people who want to use 4.y.z and used 3.y.z before, have to upgrade their JSON files to PHP. To upgrade your resource files you must generate a language profile again. The JSON files are then no longer needed.

You can delete unnecessary JSON files under Linux with the following command.

rm resources/*/*.json

Basic Usage

To detect the language correctly, the length of the input text should be at least some sentences.

use LanguageDetection\Language;
 
$ld = new Language;
 
$ld->detect('Mag het een onsje meer zijn?')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151,
    "nb" => 0.48849462365591,
    "nn" => 0.48741935483871,
    "fy" => 0.47822580645161,
    "dk" => 0.47172043010753,
    "sv" => 0.46408602150538,
    "bi" => 0.46021505376344,
    "de" => 0.45903225806452,
    [...]
)

API

`__construct(array $result = [], string $dirname = '')`

You can pass an array of languages to the constructor. To compare the desired sentence only with the given languages. This can dramatically increase the performance. The other parameter is optional and the name of the directory where the translations files are located.

$ld = new Language(['de', 'en', 'nl']);
 
// Compares the sentence only with "de", "en" and "nl" language models.
$ld->detect('Das ist ein Test');

`whitelist(string ...$whitelist)`

Provide a whitelist. Returns a list of languages, which are required.

$ld->detect('Mag het een onsje meer zijn?')->whitelist('de', 'nn', 'nl', 'af')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "nn" => 0.48741935483871,
    "de" => 0.45903225806452
)

`blacklist(string ...$blacklist)`

Provide a blacklist. Removes the given languages from the result.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('dk', 'nb', 'de')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151,
    "nn" => 0.48741935483871,
    "fy" => 0.47822580645161,
    "sv" => 0.46408602150538,
    "bi" => 0.46021505376344,
    [...]
)

`bestResults()`

Returns the best results.

$ld->detect('Mag het een onsje meer zijn?')->bestResults()->close();

Result:

Array
(
    "nl" => 0.66193548387097
)

`limit(int $offset, int $length = null)`

You can specify the number of records to return. For example the following code will return the top three entries.

$ld->detect('Mag het een onsje meer zijn?')->limit(0, 3)->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151
)

`close()`

Returns the result as an array.

$ld->detect('This is an example!')->close();

Result:

Array
(
    "en" => 0.5889400921659,
    "gd" => 0.55691244239631,
    "ga" => 0.55376344086022,
    "et" => 0.48294930875576,
    "af" => 0.48218125960061,
    [...]
)

`setTokenizer(TokenizerInterface $tokenizer)`

The script use a tokenizer for getting all words in a sentence. You can define your own tokenizer to deal with numbers for example.

$ld->setTokenizer(new class implements TokenizerInterface
{
    public function tokenize(string $str): array 
    {
        return preg_split('/[^a-z0-9]/u', $str, -1, PREG_SPLIT_NO_EMPTY);
    }
});

This will return only characters from the alphabet in lowercase and numbers between 0 and 9.

`__toString()`

Returns the top entrie of the result. Note the echo at the beginning.

echo $ld->detect('Das ist ein Test.');

Result:

de

`jsonSerialize()`

Serialized the data to JSON.

$object = $ld->detect('Tere tulemast tagasi! Nägemist!');
 
json_encode($object, JSON_PRETTY_PRINT);

Result:

{
    "et": 0.5224748810153358,
    "ch": 0.45817028027498674,
    "bi": 0.4452670544685352,
    "fi": 0.440983606557377,
    "lt": 0.4382866208355367,
    [...]
}

Method chaining

You can also combine methods with each other. The following example will remove all entries specified in the blacklist and returns only the top four entries.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->limit(0, 4)->close();

Result:

Array
(
    "nl" => 0.66193548387097
    "br" => 0.49634408602151
    "nb" => 0.48849462365591
    "nn" => 0.48741935483871
)

ArrayAccess

You can also access the object directly as an array.

$object = $ld->detect(Das ist ein Test');
 
echo $object['de'];
echo $object['en'];
echo $object['xy']; // does not exists

Result:

0.6623339658444
0.56859582542694
NULL

Supported languages

The library currently supports 110 languages. To get an overview of all supported languages please have a look at here.

Other languages

The library is trainable which means you can change, remove and add your own language files to it. If your language not supported, feel free to add your own language files. To do that, create a new directory in resources and add your training text to it.

Note: The training text should be a .txt file.

Example

|- resources
    |- ham
        |- ham.txt
    |- spam
        |- spam.txt

As you can see, we can also used it to detect spam or ham.

When you stored your translation files outside of resources, you have to specify the path.

$t->learn('YOUR_PATH_HERE');

Whenever you change one of the translation files you must first generate a language profile for it. This may take a few seconds.

use LanguageDetection\Trainer;
 
$t = new Trainer();
 
$t->learn();

Remove these few lines after execution and now we can classify texts by their language with our own training text.

FAQ

How can I improve the detection phase?

To improve the detection phase you have to use more n-grams. But be careful this will slow down the script. I figured out that the detection phase is much better when you are using around 9.000 n-grams (default is 310). To do that look at the code right below:

$t = new Trainer();
 
$t->setMaxNgrams(9000);
 
$t->learn();

First you have to train it. Now you can classify texts like before but you must specify how many n-grams you want to use.

$ld = new Language();
 
$ld->setMaxNgrams(9000);
  
// "grille pain" is french and means "toaster" in english
var_dump($ld->detect('grille pain')->bestResults());

Result:

class LanguageDetection\LanguageResult#5 (1) {
  private $result =>
  array(2) {
    'fr' =>
    double(0.91307037037037)
    'en' =>
    double(0.90623333333333)
  }
}

Is the detection process slower if language files are very big?

No it is not. The trainer class will only use the best 310 n-grams of the language. If you don't change this number or add more language files it will not affect the performance. Only creating the N-grams is slower. However, the creation of N-grams must be done only once. The detection phase is only affected when you are trying to detect big chunks of texts.

Summary: The training phase will be slower but the detection phase remains the same.

Contributing

Feel free to contribute. Any help is welcome.

License

This projects is licensed under the terms of the MIT license.

language-detection's People

Contributors

Stargazers

Watchers

Forkers

tomasliubinas gopalkumar315 percymamedy stof ad3n nekulin html2k sankam-nikolya vvenn shangfu southernvg ejobs colombo-group ferplascencia hefengxian socialisten yelluo hitum-dev headmax ruifil life347 it-4-life dustindoiron neuralnoise gradzio argonic wbraunber sornss buruhsd vuthaihoc hamidgh83 batchris ken-studio z-aec aperturetechnology kaanuki acamtech sanarafeeque zavodilo junaexp lngdet videles wayfair-archive 127 whaleinvasion phymucs muhamedpy rfperuch iquito linaslev antoniothefuture chengs2035 arazgholami ukrosoft hkocoglu borisk25 ange-rodriguez mejans hsa599 dayvsonsales nggiahao junker mykola500 dheia helmab capttofu bearerpipelinetest shortymc andraryandra vthuan1889 programmerhelloworld jomielenriquez chinaliuheng m8than joycebabu szepeviktor toflar zizu-kun hutlim fancensus

language-detection's Issues

Not correct detect Japanese

$ld = new Language(['ja']);
$ld->setMaxNgrams(9000);

var_dump($ld->detect('タイトティーンアクション'));

All results is null, but language is Japanese

Get language name by language code

I suggest an improvement. Return not only the language code, but also its name.

https://github.com/patrickschur/language-detection/blob/master/resources/README.md

Compatible for PHP 8

Hello,

Can you please make it compatible for PHP 8 as well.

Thank you.
Sorn

TypeError: Return value of LanguageDetection\LanguageResult::__toString() must be of the type string, null returned

Error caused in https://github.com/patrickschur/language-detection/blob/master/src/LanguageDetection/LanguageResult.php#L86

    /**
     * @return string
     */
    public function __toString(): string
    {
        return key($this->result);
    }

Function key() return null with an empty array.

An example case is when language input is number, e.g. echo (string) $ld->detect('1992');

I will send pull request if you need.
Thank you.

English text recognition

Hello,

This text is in English but I get this as a result.

xray no change + moderate non irradating pain nxray no change + dull pain nxray no change + biting

{ "ug-Latn": 0.422791679228218253427939998800866305828094482421875, "ch": 0.42007838408200182112040010906639508903026580810546875, "en": 0.412933373530298464260113178170286118984222412109375, "tl": 0.40669279469400054782823872301378287374973297119140625 }

Script states that the text is in Chinese before English.
What is the reason of this?

Unable to detect Chinese if there is only 1 character

$languageDetection = new Language();

$languageDetection->detect('很')->close();

The actual result would be 0 for all languages, while expecting zh-Hant and zh-Hans to have non-zero results.

Testing

sorry mistake.

Grille pain

Guten Tag,

I continue in english ;)

I compared all current language detectors based on sequence of N-grams and your solution is the best implementation.

However I have to translate very short sequences of words. For example the words 'Grille pain' which means toaster in in english returns

array(6) {
  ["it"]=>
  float(0.54711111111111)
  ["fr"]=>
  float(0.54633333333333)
  ["en"]=>
  float(0.506)
  ["de"]=>
  float(0.49488888888889)
  ["nl"]=>
  float(0.49)
  ["es"]=>
  float(0.43466666666667)
}

It's almost good! it wins on fr by a difference 0.000777.

So, I added words 'Grille pain' in fr file, set trainer and new result gives:

array(6) {
  ["fr"]=>
  float(0.54944444444444)
  ["it"]=>
  float(0.54711111111111)
  ["en"]=>
  float(0.506)
  ["de"]=>
  float(0.49488888888889)
  ["nl"]=>
  float(0.49)
  ["es"]=>
  float(0.43466666666667)
}

It's good. fr wins on it by a difference of 0.002333.

So my questions ,

1- Can I populate language files so that I can be sure that the first occurence wins with a very significant difference.

2- If yes to previous question, is the detection process slower if language files are very big?

3- I can see that, in fr file, you have put french declaration of rights. I dont think that a 200 years old text represents very well current french language. Is there somewhere some data which may be more accurate?

Thanks for your great job.

Best Regards
Michel

Possible improvement with spanish vs french

Hi,

I started using this library and it works great most of the time but I came across a problem with the following sentence in spanish: "Este es un mensaje de prueba"
That results in the following scores:
[fr] => 0.56424275560416
[es] => 0.54543466375068
[pt-PT] => 0.52006560962274
[pt-BR] => 0.51919081465282

Spanish should be the first one, I know that maybe its a very short text but maybe you can improve something here.

Cheers

__toString() must be of the type string, null returned

When there's no results, __toString() returns null which conflicts with the return type.
Happens when the input string is empty (maybe some other cases)

Negative language probability

I try to improve language detection and set separate folder with samples as mentioned

$t = new LanguageDetection\Trainer();
$t->setMaxNgrams(9000);
$t->learn(/project/language/samples');

So it created json files in language directories
But when I try to detect language:

$ld = new LanguageDetection\Language([],/project/language/samples');
$ld->detect('some text here')->close()

I got negative probability

[
     "bg" => -0.63268817204301,
     "ru" => -1.183311827957,
 ]

So if used bestResults(), the wrong language code returns. Text in my case is russian.

Is it normal that negative probability is returned?

How can I add a new language?

Discussed in #59

^{Originally posted by marcovlesmes February 2, 2024}
Hi, I am trying to incorporate a new language in the library, I have already added a .txt file in resources with the structure:
resources\nhe\nhe.txt
I trained the library and tried to detect the language of a phrase; however, the language does not appear in the list of languages. Does the new language need to be registered somewhere?

Is there any way to get the full name of the language along with the language code?

I want to get the language's full name with the code like when I get 'en'. I should be able to get 'English' as well.

The detected languages seem wrong very often

Hi.

I am trying out this library and it seems that I am getting wrong language detections.
After training the library with 9.000 n-grams, I tested this code:

use LanguageDetection\Language;

$ld = new Language;
$ld->setMaxNgrams(9000);

var_export($ld->detect('Je souhaite annuler mon abonnement')->limit(0, 3)->close());
var_export($ld->detect('Merci beaucoup')->limit(0, 3)->close());

I was expecting both sentences to be detected with French at number 1, but I got this:

array (
  'fr' => 0.83144592592593,
  'de' => 0.81363111111111,
  'en' => 0.80018962962963,
);
array (
  'en' => 0.8102134502924,
  'fr' => 0.80741520467836,
  'ca' => 0.78646198830409,
);

Note how in the second call, French is number 2 and English is number 1.
What is causing this and how can the system be improved to get better results?

Thanks in advance

Support for Kazakh language

Incorrect language is being returned for specific words

Hi,

The words "System Scaling Strategy" will return anything but English it seems. I've had West Frisian, Afrikaans and now Swedish as the "best result".

This is the output of language detection:
( 'sv' => 0.4363359707851491, 'nl' => 0.43365794278758374, 'hu' => 0.4318320146074255, 'nb' => 0.42452830188679247, 'de' => 0.4225197808886184, 'da' => 0.4211199026171637, 'en' => 0.41880706025562997, 'la' => 0.4111990261716373, 'pt-BR' => 0.4084601339013999, 'pt-PT' => 0.40833840535605603, 'id' => 0.40626902008520993, 'is' => 0.3883749239196591, 'lt' => 0.38758368837492396, 'sl' => 0.38533171028606206, 'ro' => 0.3756542909312234, 'es' => 0.37321972002434567, 'it' => 0.3685940353012781, 'et' => 0.3680462568472307, 'cs' => 0.3612903225806452, 'cy' => 0.3496652465003043, 'pl' => 0.3467437614120511, 'bs-Latn' => 0.3446743761412051, 'lv' => 0.3444917833231893, 'fr' => 0.3375532562385879, 'tr' => 0.33609251369446136, 'hr' => 0.33049300060864273, 'sk' => 0.32860620815581254, 'fi' => 0.3032258064516129, 'gd' => 0.29519172245891656, 'ga' => 0.28387096774193543, 'ka' => 0, 'ml' => 0, 'hy' => 0, 'el-polyton' => 0, 'el-monoton' => 0, 'uk' => 0, 'zh-Hans' => 0, 'zh-Hant' => 0, )

I limited the languages to exclude Frisian and Afrikaans. Why is English not the first in the list?

What dataset?

Hello there, may I ask what database has been used to create this library?

Grille pain question 4

Hi,

Sorry, I forgot question 4:

'xzy' returns

array(6) {
  ["it"]=>
  float(0.092666666666667)
  ["en"]=>
  float(0.092)
  ["nl"]=>
  float(0.088)
  ["de"]=>
  float(0.084666666666667)
  ["es"]=>
  float(0.081666666666667)
  ["fr"]=>
  float(0.043333333333333)
}

I can not launch translation with that.

-Could your briefly explain what this figures are and at which level one can say that they are reliable enough to return a detected language. You could add a bestResult() method 'winner by KO' (like germany brasil 7-0 ;).

-Maybe more interesting for developers, would be to add a validate() method which returns false if we definitely can not detect. I will insert it in my validation process.

Thanks again.
Michel

Can you recommend any article data to train better? The default data is too small

dk is not a valid language code

Please see: http://www.loc.gov/standards/iso639-2/php/code_list.php
I think you confused with the country code:

da is the language code for Danish
da-DK is the regional Danish language spoken in Danemark

Create langLibrary form different directories

Hello,

Now, when we create a library, we can use following ways:

new Language() - following directory will be used (by default): DIR . '/../../resources//.json';
We use other .json-files (take as an example, en language):
2.1 Create at any place folder $dirname=LanguageDetection/en/.
2.2 Put there your own text file: en.txt.
2.3 Train library:
```
      $t = new Trainer();
      $t->learn($dirname);
```
2.4 Then use newly created/updated .json-file: LanguageDetection/en/en.json by:

new Language([], $dirname)

So, if we want to use default lang file, we should:

Copy already existing en.txt to our newly created folder.
Add our text to existing
Train library.
Use newly created/updated en.json

Request:

Not to copy-paste, it would be good have a possibility to use already existing en.json and newly created together, something like:
new Language([], $dirname, $useDefaultFile = true):

3rd params by default is false
if dirname is defined and $useDefaultFile=true: use 2 path together - default one (
__DIR__ . '/../../resources/*/*.json'_)
and new - dirname

Trainer echo?

Hi,

I'm using the Trainer class like so:

$languageTrainer = new Trainer();
$languageTrainer->setMaxNgrams(9000);
$languageTrainer->learn();

I wasn't expecting it to echo anything - why is it not returning a success / failure?

Also, GlobIterator expects 2 parameters, 1 given.

Source of language datasets

Where is the source text dataset for the Ngrams of those 110 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

What about a third party reverse index engine?

Features of this package look really good for implementing a reverse index engine (On top of Redis+PHP or PHP itself).

Great work btw 👍

__toString() must be of the type string, null returned

PHP 7.0 + Laravel 5.5

$language = $this->languageDetect->detect($rankedKeyword->query)->__toString();

[2017-09-22 15:40:33] local.ERROR: Type error: Return value of LanguageDetection\LanguageResult::__toString() must be of the type string, null returned {"exception":"[object] (Symfony\Component\Debug\Exception\FatalThrowableError(code: 0): Type error: Return value of LanguageDetection\LanguageResult::__toString() must be of the type string, null returned at /var/www/html/meek.com.cn/trendx-crawler/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php:86)
[stacktrace]
#0 /var/www/html/meek.com.cn/trendx-crawler/app/GoogleTrendsCrawler.php(212): LanguageDetection\LanguageResult->__toString()
#1 /var/www/html/meek.com.cn/trendx-crawler/app/Console/Commands/Crawl.php(53): App\GoogleTrendsCrawler->fetchQueries(Array)
#2 [internal function]: App\Console\Commands\Crawl->handle()
#3 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(29): call_user_func_array(Array, Array)
#4 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(87): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}()
#5 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(31): Illuminate\Container\BoundMethod::callBoundMethod(Object(Illuminate\Foundation\Application), Array, Object(Closure))
#6 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Container/Container.php(549): Illuminate\Container\BoundMethod::call(Object(Illuminate\Foundation\Application), Array, Array, NULL)
#7 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Console/Command.php(180): Illuminate\Container\Container->call(Array)
#8 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Command/Command.php(264): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#9 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Console/Command.php(167): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#10 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Application.php(888): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Application.php(224): Symfony\Component\Console\Application->doRunCommand(Object(App\Console\Commands\Crawl), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/meek.com.cn/trendx-crawler/vendor/symfony/console/Application.php(125): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Console/Application.php(88): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/meek.com.cn/trendx-crawler/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(121): Illuminate\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#15 /var/www/html/meek.com.cn/trendx-crawler/artisan(37): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#16 {main}
"}

Language detection with php 5.6

Hey,
I've been trying to do some basic operations (as described in readme file), but it simply doesn't work.
I've been using PHP 5.6 for an old project and I removed those operators, which are newer (I've error reporting to E_ALL) and it doesn't have any errors, but it returns an empty array when calling this:
$ld->detect('Mag het een onsje meer zijn?')->close();

Any ideas what I'm doing wrong or perhaps if there are any problems with older versions of PHP?

Feature Request - Min language's values

Currently we can use the limit function to return a specific quantity of languages:

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->limit(0, 4)->close();

Array
(
    "nl" => 0.66193548387097
    "br" => 0.49634408602151
    "nb" => 0.48849462365591
    "nn" => 0.48741935483871
)

Would be nice to have a standalone function in the library to limit the results by its values.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->min(0.5)->close(); // or atLeast() instead of min()

Array
(
    "nl" => 0.66193548387097
)

// In case of a greater than number:

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->min(1)->close(); // or atLeast() instead of min()

Array
(
)

the word "LOL" is not an english word ?

dude how the lol word is not an english word ?

Fatal error when trying to detect emoji only string

Failing at : https://github.com/patrickschur/language-detection/blob/master/src/LanguageDetection/NgramParser.php#L137

src/LanguageDetection/NgramParser.php:137 Type error: array_merge() expects at least 1 parameter, 0 given

When calling

LanguageDetection\Language->detect('\xF0\x9F\xA4\x97')

Would be great if the method detect() can return null or false if it can detect instead of a fatal !

Thank you for your awesome lib !

need a new release

I was using composer install this package, but i still has this error

PHP Fatal error:  Uncaught TypeError: array_merge() expects at least 1 parameter, 0 given in /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/NgramParser.php:139
Stack trace:
#0 /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/NgramParser.php(139): array_merge()
#1 /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/Language.php(50): LanguageDetection\NgramParser->getNgrams('\xEF\xBC\x9F\xEF\xBC\x9F')
#2 /home/.../Fetchers/Mapping.php(140): LanguageDetection\Language->detect('\xEF\xBC\x9F\xEF\xBC\x9F')
#3 /home/.../test/index.php(31): Fetchers\Mapping::languageDetect('\xEF\xBC\x9F\xEF\xBC\x9F')
#4 {main}
  thrown in /home/.../vendor/patrickschur/language-detection/src/LanguageDetection/NgramParser.php on line 139

but this bug is fixed by this commit 610e8a1

when run composer update patrickschur/language-detection the code still not updated

where is project amdvbflash?

I can't find the project amdvbflash,would you update it?

Detection of english string does not work correctly

I have tried this on few strings, to find out score, all strings are clearly english and result is deutch? Does this even work?

How can the library detect the wrong language on such simple text?

I tried the library on the following sentence: "Please delete my account."
The identified languages that i got are listed below.
English came in 5th!

BTW, I tried this with quite a few English sentences, and the results were not very good...

Am i using the library wrong?
This is what I did:

<?php
$detector = new LanguageDetection\Language();
var_export($detector->detect("Please delete my account.")->close());
?>

array (
'fr' => 0.50662139219015,
'es' => 0.50311262026033,
'ca' => 0.49643463497453,
'ro' => 0.48630447085456,
'en' => 0.48381437464629,
'ia' => 0.47747594793435,
'gl' => 0.46955291454443,
'wa' => 0.46768534238823,
'nb' => 0.45551782682513,
'it' => 0.45297113752122,
'ch' => 0.44923599320883,
'pt-PT' => 0.44069043576684,
'da' => 0.44063384267119,
'pt-BR' => 0.43848330503679,
'fy' => 0.43746462931522,
'nn' => 0.43497453310696,
'af' => 0.4343520090549,
'br' => 0.43033389926429,
'sv' => 0.42959818902094,
'ga' => 0.42716468590832,
'eo' => 0.42648556876061,
'la' => 0.4258064516129,
'io' => 0.42354272778721,
'de' => 0.42195812110922,
'et' => 0.42167515563101,
'hu' => 0.41680814940577,
'gn' => 0.41533672891907,
'to' => 0.41041312959819,
'id' => 0.41001697792869,
'tr' => 0.40288624787776,
'cy' => 0.39711375212224,
'ss' => 0.39400113186191,
'ug-Latn' => 0.39383135257499,
'gd' => 0.3933220147142,
'nl' => 0.39286926994907,
'ms-Latn' => 0.38828522920204,
'ku' => 0.38760611205433,
'sq' => 0.38664402942841,
'eu' => 0.38075834748161,
'cs' => 0.37826825127334,
'fi' => 0.37549518958687,
'wo' => 0.37006225240521,
'bi' => 0.36949632144878,
'xh' => 0.35529145444256,
'ln' => 0.35466893039049,
'sl' => 0.35336728919072,
'jv' => 0.35217883418223,
'sk' => 0.34821731748727,
've' => 0.33938879456706,
'ng' => 0.32823995472552,
'ig' => 0.32591963780419,
'pl' => 0.32382569326542,
'mt' => 0.32252405206565,
'co' => 0.32116581777023,
'lv' => 0.3184493491794,
'lt' => 0.31839275608376,
'sr-Latn' => 0.31618562535371,
'lg' => 0.31601584606678,
'tl' => 0.31494057724958,
'so' => 0.31290322580645,
'bs-Latn' => 0.30820599886814,
'hr' => 0.29881154499151,
'fj' => 0.29626485568761,
'ay' => 0.29111488398415,
'vi' => 0.28726655348048,
'az-Latn' => 0.28053197509904,
'kr' => 0.27549518958687,
'is' => 0.27464629315224,
'fo' => 0.26779852857951,
'yo' => 0.26445953593662,
'ha' => 0.26225240520656,
'mh' => 0.26140350877193,
'ty' => 0.22948500282965,
'nv' => 0.21522354272779,
'zh-Hans' => 0,
'sr-Cyrl' => 0,
'uz' => 0,
'ur' => 0,
'uk' => 0,
'ug-Arab' => 0,
'ta' => 0,
'th' => 0,
'tt' => 0,
'ab' => 0,
'sa' => 0,
'el-polyton' => 0,
'am' => 0,
'ar' => 0,
'az-Cyrl' => 0,
'be' => 0,
'bg' => 0,
'bn' => 0,
'bo' => 0,
'bs-Cyrl' => 0,
'cr' => 0,
'dz' => 0,
'el-monoton' => 0,
'fa' => 0,
'ru' => 0,
'gu' => 0,
'he' => 0,
'hi' => 0,
'hy' => 0,
'iu' => 0,
'ja' => 0,
'ka' => 0,
'km' => 0,
'ko' => 0,
'lo' => 0,
'mn-Cyrl' => 0,
'ms-Arab' => 0,
'zh-Hant' => 0,
)

Make location of the language resources configurable

It would be nice if the location of the resources would be configurable. Currently the Trainer and Language classes are bound to the resources folder. It would make managing the trainer files a bit easier.

Thanks :)

Deprecation notice with PHP 8.1

In PHP 8.1 a lot of interfaces got updated with new return types. That includes ArrayAccess. Therefore LanguageResult produces some little notices on PHP 8.1:

Deprecated: Return type of LanguageDetection\LanguageResult::offsetExists($offset) should either be compatible with ArrayAccess::offsetExists(mixed $offset): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 37
Deprecated: Return type of LanguageDetection\LanguageResult::offsetGet($offset) should either be compatible with ArrayAccess::offsetGet(mixed $offset): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 46
Deprecated: Return type of LanguageDetection\LanguageResult::offsetSet($offset, $value) should either be compatible with ArrayAccess::offsetSet(mixed $offset, mixed $value): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 56
Deprecated: Return type of LanguageDetection\LanguageResult::offsetUnset($offset) should either be compatible with ArrayAccess::offsetUnset(mixed $offset): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /app/vendor/patrickschur/language-detection/src/LanguageDetection/LanguageResult.php on line 68

version: v5.1.0

Issue with detection of English text

Phrase : A beautiful villa in eastern Sweden --- Language Detected : sv

Phrase : Uma bela casa no leste da Suécia --- Language Detected : pt-BR
Phrase : Eine schöne Villa in Ost-Schweden --- Language Detected : de
Phrase : En vacker villa i östra sverige --- Language Detected : sv
Phrase : পূর্ব সুইডেনে একটি সুন্দর বাগানবাড়ি --- Language Detected : bn
Phrase : Une belle villa en Suède orientale --- Language Detected : it
Phrase : فيلا جميلة في شرق السويد --- Language Detected : ar
Phrase : En vakker villa i Øst-Sverige --- Language Detected : sv
Phrase : Una hermosa villa en el este de Suecia --- Language Detected : es
Phrase : En smuk villa i det østlige Sverige --- Language Detected : dk
Phrase : Kaunis huvila Itä Ruotsissa --- Language Detected : fi
Phrase : पूर्वी स्वीडन में एक खूबसूरत विला --- Language Detected : hi
Phrase : Una bella villa in Svezia orientale --- Language Detected : it
Phrase : 東部スウェーデンの美しいヴィラ --- Language Detected : ja
Phrase : O vilă frumoasă în estul Suediei --- Language Detected : ro
Phrase : Isang magandang villa sa silangang Sweden --- Language Detected : jv
Phrase : Красивая вилла в восточной части Швеции --- Language Detected : ru
Phrase : Красива вілла в східній частині Швеції --- Language Detected : uk
Phrase : A beautiful villa in eastern Sweden --- Language Detected : sv
Phrase : Uma bela casa no leste da Suécia --- Language Detected : pt-BR
Phrase : Eine schöne Villa in Ost-Schweden --- Language Detected : de
Phrase : En vacker villa i östra sverige --- Language Detected : sv
Phrase : পূর্ব সুইডেনে একটি সুন্দর বাগানবাড়ি --- Language Detected : bn
Phrase : Une belle villa en Suède orientale --- Language Detected : it
Phrase : فيلا جميلة في شرق السويد --- Language Detected : ar
Phrase : En vakker villa i Øst-Sverige --- Language Detected : sv
Phrase : Una hermosa villa en el este de Suecia --- Language Detected : es
Phrase : En smuk villa i det østlige Sverige --- Language Detected : dk
Phrase : Kaunis huvila Itä Ruotsissa --- Language Detected : fi
Phrase : पूर्वी स्वीडन में एक खूबसूरत विला --- Language Detected : hi
Phrase : Una bella villa in Svezia orientale --- Language Detected : it
Phrase : 東部スウェーデンの美しいヴィラ --- Language Detected : ja
Phrase : O vilă frumoasă în estul Suediei --- Language Detected : ro
Phrase : Isang magandang villa sa silangang Sweden --- Language Detected : jv
Phrase : Красивая вилла в восточной части Швеции --- Language Detected : ru
Phrase : Красива вілла в східній частині Швеції --- Language Detected : uk
Phrase : A beautiful villa in eastern Sweden --- Language Detected : sv
Phrase : Uma bela casa no leste da Suécia --- Language Detected : pt-BR
Phrase : Eine schöne Villa in Ost-Schweden --- Language Detected : de
Phrase : En vacker villa i östra sverige --- Language Detected : sv
Phrase : পূর্ব সুইডেনে একটি সুন্দর বাগানবাড়ি --- Language Detected : bn
Phrase : Une belle villa en Suède orientale --- Language Detected : it
Phrase : فيلا جميلة في شرق السويد --- Language Detected : ar
Phrase : En vakker villa i Øst-Sverige --- Language Detected : sv
Phrase : Una hermosa villa en el este de Suecia --- Language Detected : es
Phrase : En smuk villa i det østlige Sverige --- Language Detected : dk
Phrase : Kaunis huvila Itä Ruotsissa --- Language Detected : fi
Phrase : पूर्वी स्वीडन में एक खूबसूरत विला --- Language Detected : hi
Phrase : Una bella villa in Svezia orientale --- Language Detected : it
Phrase : 東部スウェーデンの美しいヴィラ --- Language Detected : ja
Phrase : O vilă frumoasă în estul Suediei --- Language Detected : ro
Phrase : Isang magandang villa sa silangang Sweden --- Language Detected : jv
Phrase : Красивая вилла в восточной части Швеции --- Language Detected : ru
Phrase : Красива вілла в східній частині Швеції --- Language Detected : uk

Above are the phrase and the language detected. The library seems to be working fine, but its not detecting the correct for simple english text.

What's the right way of checking whether or not the text is in a specific language?

Hi! I'd like to use this library to check whether or not a (plain) text is in English.

What I'd like to do is something like:

if ((new LanguageDetector(['en']))->detect($str)->close()['en'] < THRESHOLD)
  echo 'It surely is not in English';

It seems to me that the returned score isn't a probability, though, because it changes based on the maxNgrams used (and sometimes is negative). This makes me think that the "threshold" solution above may not be valid.

I might be able to find the answer on my own by digging a bit more into the code but I thought that an answer from the contributors could benefit other users too. Plus, it would be good to add an explanation about what the scores are in the README.

patrickschur / language-detection Goto Github PK

language-detection's Introduction

language-detection

Table of Contents

Installation with Composer

How to upgrade from 3.y.z to 4.y.z?

Basic Usage

API

__construct(array $result = [], string $dirname = '')

whitelist(string ...$whitelist)

blacklist(string ...$blacklist)

bestResults()

limit(int $offset, int $length = null)

close()

setTokenizer(TokenizerInterface $tokenizer)

__toString()

jsonSerialize()

Method chaining

ArrayAccess

Supported languages

Other languages

Example

FAQ

How can I improve the detection phase?

Is the detection process slower if language files are very big?

Contributing

License

language-detection's People

Contributors

Stargazers

Watchers

Forkers

language-detection's Issues

Discussed in #59

Recommend Projects

Recommend Topics

Recommend Org

How to upgrade from `3.y.z` to `4.y.z`?

`__construct(array $result = [], string $dirname = '')`

`whitelist(string ...$whitelist)`

`blacklist(string ...$blacklist)`

`bestResults()`

`limit(int $offset, int $length = null)`

`close()`

`setTokenizer(TokenizerInterface $tokenizer)`

`__toString()`

`jsonSerialize()`