Giter Site home page Giter Site logo

pemistahl / lingua-py Goto Github PK

View Code? Open in Web Editor NEW
1.0K 12.0 43.0 262.41 MB

The most accurate natural language detection library for Python, suitable for short text and mixed-language text

License: Apache License 2.0

Python 100.00%
nlp natural-language-processing language-detection language-recognition language-identification language-classification python-library

lingua-py's Introduction

Hello, thank you for visiting my profile. 🖖🏻🤓

My name is Peter. There are actually more people than I thought who have the same first and family name, so I bother you with my middle name Michael as well. Were Type O Negative really so popular back then? I haven't got a clue...

I hold a Master's degree in computational linguistics from Saarland University in Saarbrücken, Germany. After my graduation in 2013, I decided against a research career because I like building things that help people now and not in the unforeseeable future.

Currently, I work for Riege, a leading provider of cloud-based software for the logistics industry. In my free time, I like working on open source projects in the fields of computational linguistics and string processing in general.

I have a special interest in modern programming languages and green computing. I believe that the software industry should make more significant contributions towards environmental protection. Great advances have been made to decrease energy consumption and emissions of hardware. However, those are often canceled out by poorly optimized software and resource-intensive runtime environments.

This is why I'm especially interested in the Rust programming language which allows writing performant and memory-safe applications without the need for a garbage collector or a virtual runtime environment, making use of modern syntax abstractions at the same time.

For those of you interested in how Rust and related technology can accomplish the goal of more eco-friendly software, I strongly recommend you to read the dissertation Energyware Engineering: Techniques and Tools for Green Software Development published in 2018 by Rui Pereira at the University of Minho in Portugal.


lingua-py's People

Contributors

alex-kopylov avatar bscan avatar dependabot[bot] avatar marco-c avatar pemistahl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingua-py's Issues

Add absolute confidence metric

Is it possible to get the non-relative confidence score for predictions? I want to identify the language of some text I have but I want to filter it based on the confidence score, say only considering the translation if the confidence is over 0.99, but I can't figure out how to get this raw number.

detect_multiple_languages_of predicts incorrect languages

Using version 1.3.1

Using a text that is in Catalan language only, that does not contain any fragments from other languages, and that it's very standard kind of text, detect_multiple_languages_of method detects: CATALAN, SOMALI, LATIN, FRENCH, SPANISH and PORTUGUESE. The expectation is that should report that the full text is CATALAN.

Code to reproduce the problem:

from lingua import Language, LanguageDetectorBuilder, IsoCode639_1

with open('text-catalan.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()
    
    for result in detector.detect_multiple_languages_of(text):
        print(f"{result.language.name}")

Related to this problem also is that detect_language_of and detect_multiple_languages_of predict different languages over the same text. Below an example on the same input detect_language_of predicts Catalan and detect_multiple_languages_of predicts Tsonga.

My expectation is that both methods will predict the same given the same input.

Code sample:

from lingua import Language, LanguageDetectorBuilder, IsoCode639_1

with open('china.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()
      
    result = detector.detect_language_of(text)
    print(f"detect_language_of prediction: {result}")
    
    for result in detector.detect_multiple_languages_of(text):
        print(f"detect_language_of prediction: {result.language.name}")

Decrease memory and resource usage?

Hi, your module works very well to detect the language, however it uses a lot of ram and power. Is there any way to reduce the memory usage and performance? I need to detect in all available languages, so I can't decrease the number of languages to load.

Support ONNX format for language models

Hi! I'm reaching out to kindly request the availability of the ONNX file for the language detector model that is currently being utilized within the project. Thanks.

get_language_of takes too long

I im profiling my code and find that get_language_of takes too long, maybe in lib it can be flag as example fast=True to ignore logic inside with detecting two languages and detect just one and return in? It would be helpful for users that needs fast analyse.

OR

I analyse huge texts, maybe someone have experiance with analyse just peace of text to speed up?

Incorrect ISO 639-3 code for Urdu

Hi, there is a typo in lingua/language.py at line 378:

URDU = (70, IsoCode639_1.UR, IsoCode639_3.UKR, frozenset([_Alphabet.ARABIC]))

ZeroDivisionError: float division by zero

On occassion on longer texts I am getting this error. Steps to reproduce:

detector.detect_language_of(text)

Where text is

Flagged as potential abuser? No Retailer | Concept-store() Brand order:  placed on  Payout scheduled date: Not Scheduled Submission type: Lead How did you initially connected?: Sales rep When did you last reach out?:  (UTC) Did you add this person through ?: I don't know Additional information: Bonjour, Je travaille avec cette boutique depuis plusieurs années. C'est moi qui lui ai conseillé de passer par pour son réassort avec le lien direct que je lui avais transmis. Pourriez vous retirer la commission de 23% ? Je vous remercie. En lien pour preuve la dernière facture que je lui ai éditée et qui date du mois dernier. De plus, j'ai redirigé vers plusieurs autres boutiques avec qui j'ai l'habitude de travailler. Elles devraient passer commande prochainement: Ça m'ennuierai de me retrouver avec le même problème pour ces clients aussi. Merci d'avance pour votre aide ! Cordialement Click here to check out customer uploaded file Click here to approve / reject / flag as potential abuser

It's not an isolated example

Any help would be massively appreciated

Weird issues with short texts in Russian

Hi team, great library! Wanted to share an example I stumbled upon, when detecting the language of a very short basic Russian text. It comes out as Macedonian, even though as far as I can tell it's not actually correct Macedonian but is correct Russian. It is identified correctly by AWS Comprehend and other APIs:

detector = LanguageDetectorBuilder.from_all_languages().build()
detector.detect_language_of("как дела")
Language.MACEDONIAN

Distinguish between different variations of the same language

Hi, I'm wondering whether it is possible for lingua to distinguish between variations of the same language, for example: Simplified Chinese and Traditional Chinese, Norwegian Bokmål and Norwegian Nynorsk.
AFAIK, langdetect could distinguish between Simplified and Traditional Chinese while other alternatives can't.

Is it possible to detect only English using lingua?

Hi, I'm currently working on a project which requires me to filter all non-English text. It is comprised of mostly short texts, most of them in English. I thought of building the language detector with only Language.ENGLISH but got an error that at least two languages are required. I do not care about knowing what language each non-English text is actually in, only English / Non-English. What would be the correct way to go about it with lingua? I think it might be problematic if I set it to recognize all languages because it might just add unnecessary noise to the prediction, which should have a bias towards English in my case.
Thanks!

Make the library compatible with Python versions < 3.9

Hello, I try to use the module on google colab and I get this error during the installation:

ERROR: Could not find a version that satisfies the requirement lingua-language-detector (from versions: none)
ERROR: No matching distribution found for lingua-language-detector

What are the requirements of this module?

Wrong probabilities

First of all, thank you for Lingua!

I've come across a few cases where Lingua now detects a different language than it did previously. I think the recent changes have introduced a bug: In some cases, _look_up_ngram_probability returns a probability even though the ngram is not present in the corresponding ngram table.

Test script:

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
print (detector._look_up_ngram_probability(Language.GERMAN, "teí"))
print (detector._look_up_ngram_probability(Language.ENGLISH, "teí"))
print (detector._look_up_ngram_probability(Language.SPANISH, "teí"))

Output before the August changes:

0.0
0.0
-8.193094053015134

Output after the August changes:

-14.266
-12.05
-8.195

No module named 'regex._regex'

When deploying as a Layer to AWS Lambda and running a test, I receive this error:

[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'regex._regex'

Runtime: python3.9
Architecture: x86_64

I tried with regex version 2022.10.31 and with latest version 2023.3.23 as well. All runs normally on local machine running 3.9.

Luxembourgish

Would it be possible to include Luxembourgish?

I believe 2 EU langauges are missing from the list: Maltese and Luxembourgish.

It seems Thierry Goeckel already build luxdetection, but maybe we can integrate this?
https://github.com/rotzbouw/luxdetect

Would be happy to discuss how to go forward and help out.

Language filtering causes wrong results

Hi, I think the language filtering that takes place before the n-grams are checked works too aggressively. I've made the observation that one non-German character is sufficient for Lingua to dismiss German as a possible language. Here are a few examples:

Vandalismus in Rotenburg: Bürger unterstützen Cafébesitzer
Barça-Fans feiern fünften Saisonsieg
Führung der César-Akademie zieht sich zurück
Ein gut gekühlter Roséwein
Flüchtlingsreferendum in Ungarn: Eigentor für Orbán
Charité-Beschäftigte streikten schon mehrfach
DFB: Fünf Clásico-Erkenntnisse für Bundestrainer Joachim Löw
Der Eröffnungstag des Sónar-Festivals für elektronische Musik gehörte den Instrumentalkünstlern

Bad detection in common word

Hello, I need to detect language in user generated content, it's for a chat. I have tested this library but the library have strange result in short text, for exemple the word hello:

from lingua import Language, LanguageDetectorBuilder

languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

text = """
Hello
"""
confidence_values = detector.compute_language_confidence_values(text.strip())
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

return spanich (but the correct language is English)

SPANISH: 1.00
ENGLISH: 0.95
FRENCH: 0.87
GERMAN: 0.82

Do you know some tips to have better result for detecting language on user generated content?

Failed to predict correct language for popular English single words

Hello

  • "ITALIAN": 0.9900000000000001,
  • "SPANISH": 0.8457074930316446,
  • "ENGLISH": 0.6405700388041755,
  • "FRENCH": 0.260556921899765,
  • "GERMAN": 0.01,
  • "CHINESE": 0,
  • "RUSSIAN": 0

Bye

  • "FRENCH": 0.9899999999999999,
  • "ENGLISH": 0.9062076381164255,
  • "GERMAN": 0.6259792361883574,
  • "SPANISH": 0.46755135335558035,
  • "ITALIAN": 0.01,
  • "CHINESE": 0,
  • "RUSSIAN": 0

Loss (not Löss)

  • "GERMAN": 0.99,
  • "ENGLISH": 0.9177028091362562,
  • "ITALIAN": 0.9082690119891484,
  • "FRENCH": 0.7091301303929289,
  • "SPANISH": 0.01,
  • "CHINESE": 0,
  • "RUSSIAN": 0

Improve single language detection when words in other languages are quoted

When I put in german sentences with japanese words quoted then it might happen, that lingua claims it's 100% japanese.
For example:
Wir stoßen an: "かんぱい". Er lächelte. (in english, if you are interested: »We toasted: "kanpai". He smiled«) leads to a ConfidenceValue of 1.0 of japanese. While Wir stoßen an. Er lächelte. has a ConfidenceValue of 0.6014287047855706 for german and 0.0 for japanese (I included all languages for detection).

The expected result in both should be german, maybe with slight japanese confidence in the first case since a japanese word is quoted but it should not be 100% japanese.

Proposition: Add confidence value to the output of method Detector.detect_language_of

Hi, is it possible to add the confidence value to the output of the method Detector.detect_language_of(text)?
Currently I'm obtaining the confidence (assuming the returned language is not None) by additionally calling the method Detector.compute_language_confidence(text, language),even though the confidence is already computed by the previous method.

'compute_language_confidence_values' probabilities do not sum to 1

I ran a sample

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
confidence_values = detector.compute_language_confidence_values("Cereal Churros Sabor A Canela Kellogg´S 260 Gr")
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

And the output is

SPANISH: 1.00
ENGLISH: 0.96
GERMAN: 0.87
FRENCH: 0.86

The documentation explains that the probability will sum to 1 which makes sense to me. But here, it seems that a binary classification is done and languages are ranked by the binary classification probability. Is there a bug or anything?

Also, if I have less languages to be classified to, does that make the results more accurate?

Multiple Languages

Hi! Thanks a lot for your "lingua"!

Could you please test it:

English language Английский язык

and

English language - Английский язык

?

lingua

My code is:

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.RUSSIAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
sentence = '%text_from_memo%'
for result in detector.detect_multiple_languages_of(sentence): print(f"{result.language.name} {sentence[result.start_index:result.end_index]}")

But I'm on Delphi 11 now (+ Python 3.10.9), so I'm not sure who is the source of the problem :)

Strategy to split text by language

Hello,

I would like to split multilingual texts by language.

I have a database of emails some of which are multilingual. There are some 40 languages and, a priory, one does not know which languages may be present in each email.

Only three languages are significant enough to warrant a machine learning approach.

The following is relatively accurate, but I would like to know if there would be a better strategy to optimise accuracy and/or speed:

from lingua import Language, LanguageDetectorBuilder
from collections import defaultdict

main_langs = ['ENGLISH', 'FRENCH', 'GERMAN']

sentence = "Parlez-vous français? " + \
			"Ich spreche Französisch nur ein bisschen. " + \
			"Desde luego, no cabe ninguna duda. " + \
			"Oui, merci. Je vous en prie monsieur. " + \
			"A little bit is better than nothing. That is wonderful, isn't it. " + \
			"Indeed, I completely agree. " + \
			"Acho que sim, muito obrigado. " + \
			"Ja, das ist wunderbar. Danke." + \
			"To summarise, this is complete nonsense. "
			
detector_global = LanguageDetectorBuilder.from_all_spoken_languages().build()

languages = []
for result in detector_global.detect_multiple_languages_of(sentence):
	lang = result.language.name
	languages.append(eval('Language.' + lang))

languages = list(set(languages))

detector_local = LanguageDetectorBuilder.from_languages(*languages).build()

lang_dict = defaultdict(list)
for result in detector_local.detect_multiple_languages_of(sentence):
	lang = result.language.name
	text = sentence[result.start_index:result.end_index]
	if lang in main_langs:
		lang_dict[lang].append(text)
	else:
		lang_dict['OTHER'].append(text)

print(lang_dict)

This gives:

{'FRENCH': ['Parlez-vous français? '], 'GERMAN': ['Ich spreche Französisch nur ein bisschen. ', 'ist wunderbar. Danke.To '], 'OTHER': ['Desde luego, no cabe ninguna duda. Oui, merci. Je vous ', 'en prie monsieur. A ', 'Acho que ', 'sim, muito obrigado. Ja, das '], 'ENGLISH': ["little bit is better than nothing. That is wonderful, isn't it. Indeed, I completely agree. ", 'summarise, this is complete nonsense. ']}

Which is not perfect, but it may be fair enough considering the level of difficulty of this exercise (shuffled multilingual short sentences).

The strategy here is as follows:

  1. First, brute-force language detection.
  2. Followed by text classification with the subset of languages detected in step one.

This seems to work somehow better than a single brute-force step (would that make sense?) and significantly better than a targetted approach where the only models considered are English, French and German.

Any thoughts or ideas?

Please provide performance metrics in the benchmarks

I'm impressed by the accuracy of Lingua as compared to even fasttext, but it would be very useful to also see performance metrics in the benchmarks to determine if that accuracy comes at a cost. Likewise it would be useful for comparing lingua's low and high accuracy modes.

Speed question

Hello,

I have not tested the package yet but I am interested in speed comparison vs CLD3 or langid (python and C versions)

did you benchmark from that perspective ?

Thanks.

Detect multiple languages in mixed-language text

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

Returning IsoCode639_1

Hello, is there any option to make lingua return the language as its IsoCode639_1 instead of Language.ENGLISH? In the API documentation I have found a class for it but no further information/an example on how to use it.

Thank you

Caught an IndexError while using detect_multiple_languages_of

On the test_case:

, Ресторан «ТИНАТИН»

Code fell down with an error:

Traceback (most recent call last):
  File "/home/essential/PycharmProjects/pythonProject/test_unnest.py", line 363, in <module>
    for lang, sentence in detector.detect_multiple_languages_of(text)
  File "/home/essential/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/lingua/detector.py", line 389, in detect_multiple_languages_of
    _merge_adjacent_results(results, mergeable_result_indices)
  File "/home/essential/PycharmProjects/pythonProject/venv/lib/python3.10/site-packages/lingua/detector.py", line 114, in _merge_adjacent_results
    end_index=results[i + 1].end_index,
IndexError: list index out of range

Code example:

languages = [Language.ENGLISH, Language.RUSSIAN, Language.UKRAINIAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
text = ', Ресторан «ТИНАТИН»'
sentences = [(lang, sentence) for lang, sentence in detector.detect_multiple_languages_of(text)]

Generative models/LLMs in the benchmarks

Would be interesting to have "prompt + LLM" accuracy in the benchmark as well. A simple prompt to GPT4 and restricting the output with LMQL should be quite straightforward.

Proposition: Using prior language probability to increase likelihood

@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix.

Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.

For example: #100

Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.

There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.

So my proposition to add detector.detect_language_with_prior function and factorize it with prior: likelihood = probability X prior_probability

For example: #97

detector.detect_language_of("Hello")

"ITALIAN": 0.9900000000000001,
"SPANISH": 0.8457074930316446,
"ENGLISH": 0.6405700388041755,
"FRENCH": 0.260556921899765,
"GERMAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0
detector.detect_language_with_prior("Hello")

# Of course constants are for illustrative purposes only.
# Results should be normalized afterwords
"ENGLISH": 0.6405700388041755 * 0.577,
"SPANISH": 0.8457074930316446 * 0.045,
"ITALIAN": 0.9900000000000001 * 0.017,
"FRENCH": 0.260556921899765 * 0.039,

Linked issues:

Any speed benchmarks

Thank you for the nice library and detailed accuracy benchmarks. Do you have a plan for a similar benchmarking for speed?

Chinese breaks multi language detection

Hello,
it looks like that chinese in a text breaks multi language detection. I know: its experimental, but it works most of the time pretty good.
Example:
`text="Płaszczowo-rurowe wymienniki ciepła Uszczelkowe der blaue himmel über berlin 中文 the quick brown fox jumps over the lazy dog"

detector=LanguageDetectorBuilder.from_languages(Language.ENGLISH, Language.GERMAN, Language.POLISH).build()
detector.detect_multiple_languages_of(text)
[DetectionResult(start_index=0, end_index=48, word_count=4, language=Language.POLISH), DetectionResult(start_index=48, end_index=77, word_count=5, language=Language.GERMAN)]
`

detect_multiple_languages_of is very slow

Using version 1.3.1

In a text that is 3.5K (31 lines) in my machine detect_multiple_languages_of takes 26.56 seconds while detect_language_of takes only 1.68 seconds.

26 seconds to analyse 3.5K of text (throughput of ~7 seconds per 1K) makes detect_multiple_languages_of method really not suitable for processing large corpus.

Code used for the benchmark:


from lingua import Language, LanguageDetectorBuilder, IsoCode639_1
import datetime


with open('text.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()
    
    start_time = datetime.datetime.now()
    result = detector.detect_language_of(text)
    print('Time used for detect_language_of: {0}'.format(datetime.datetime.now() - start_time))
    print(result.iso_code_639_1)

    start_time = datetime.datetime.now()    
    results = detector.detect_multiple_languages_of(text)    
    print('Time used for detect_multiple_languages_of: {0}  '.format(datetime.datetime.now() - start_time))    
    for result in results:
        print(result)
        print(f"** {result.language.name}")

Does not detect Hindi

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.HINDI]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
confidence_values = detector.compute_language_confidence_values("Bhai aapka isme")
for language, value in confidence_values:
print(f"{language.name}: {value:.2f}")

ENGLISH: 1.00
HINDI: 0.00

That's 100% wrong

Import of LanguageDetectorBuilder failed

When loading the LanguageDetectorBuilder as recommended in the readme, I received the following error:

from lingua import LanguageDetectorBuilder
...
ImportError: cannot import name 'LanguageDetectorBuilder' from 'lingua'

The following worked for me:

from lingua.builder import LanguageDetectorBuilder

Some character to language mappings are incorrect

Hello! My understading is that this mapping:

CHARS_TO_LANGUAGES_MAPPING: Dict[str, FrozenSet[Language]] = {

It's used by the rule system to identity languages based on chars. Is my assumption correct?

Looking at this:

"ÁáÍíÚú": frozenset(

Catalan language for example does NOT have "Áá" as valid chars (see reference https://en.wikipedia.org/wiki/Catalan_orthography#Alphabet).

Looking at the data I see other mappings that do not seem right.

May be the case that these mappings can be improved?

Reduce memory usage?

Naive (potential) user question here. I'm looking for a good, up to date language detection library for Annif - see this issue. Lingua seems promising, but it seems to require quite a lot of memory, especially when all supported languages are considered - this is pointed out in the README. I tested detecting the language of the example sentence "languages are awesome" and it required 1.8GB of memory. When I chose to preload all models, this increased to 2.6GB.

I tested doing the same with pycld3 and langdetect and their memory usage was much much lower - too little to bother measuring accurately. I don't see anything in the README that would justify using such huge amounts of RAM compared to other implementations. Having the rules is certainly good, but I don't think they use lots of RAM.

I'm wondering if there's some trick that other language detection libraries are performing to reduce their memory requirements? Could Lingua do that too? Or is this just a tradeoff that you have to accept if you want to achieve the high accuracy? For my purposes, although it's nice to have good accuracy, this isn't a top priority. It would also help to be able to choose smaller and faster models with slightly reduced accuracy.

Error: ZeroDivisionError: float division by zero

Hello.

When running this code with lingua_language_detector version 1.3.0.

with open('text.txt') as fh:
    text = fh.read()
    detector = LanguageDetectorBuilder.from_all_languages().build()
    print(text)
    result = detector.detect_language_of(text)
    print(result)

I get this error:

Traceback (most recent call last):
  File "/home/jordi/sc/crux-top-lists-catalan/bug.py", line 9, in <module>
    result = detector.detect_language_of(text)
  File "/home/jordi/.local/lib/python3.10/site-packages/lingua/detector.py", line 272, in detect_language_of
    confidence_values = self.compute_language_confidence_values(text)
  File "/home/jordi/.local/lib/python3.10/site-packages/lingua/detector.py", line 499, in compute_language_confidence_values
    normalized_probability = probability / denominator
ZeroDivisionError: float division by zero

I attached the text file that triggers the problem. It works fine with others texts.
This happens often in a crawling application that I'm testing.

Transformer models for Language Detection

I've been experimenting with language detection for a few months due to the necessity of accurate language detection for a translation project, where the detection of a wrong language can lead to text going down an incorrect pipeline and output nonsense to the individual who requested a translation. Because of this, I've been looking into language detection libraries such as lingua - but it's an incredibly complex thing to balance accuracy with latency, as you guys are well aware.
Lingua is amazing, and I thank the maintainers/developers for it, but for so many cases it isn't usable due to latency issues with detections - especially in a production environment where people expect results automatically (the downside of the internet ig).
So to solve this issue for myself, I finetuned a pre-trained AI model - amazing concept - called mT5 (have only used small version so far), a pre-trained model from Google that has seen over 101 languages' in it's unsupervised pretraining phase. It's still training right now but early results (a day into training) show similar outputs to lingua's low accuracy mode (using lingua's 3 classes of test sets). I still need to conduct testing, incorporating the model's execution into your accuracy reporter (thanks for that btw)

This model, once finetuned with the Huggingface Trainer API, can be converted to the library CTranslate2, which provides outstanding support for the inference of Transformer models, which I use for my translation projects and this model. This allows the utilization of cpu for fast inference, where a gpu may not be accessible (and makes the optimized-cpu throughtput similar to unoptimized-gpu throughput). The latency is low for what's expected of a large machine learning model pipeline - which is stated as such in the ReadMe - thanks to CTranslate2's efficiency. And it can use CPU or GPU, offering those with a GPU the ability to use it to speed detections even more. I need to conduct further testing regarding throughput and accuracy (currently continuing training so can't conduct accurate throughput measurements)

To sum up:

Pros

  1. Faster detection
  2. Efficient detection batching
  3. Ability to suppress detections of specific languages (suppressed_sequences in translate_batch method)
  4. Selection of gpu/cpu selection (as well as intra_threads and inter_threads if required)
  5. Low memory usage (295mb model file on disk [ctranslate2 conversion])
  6. Transformer neural model, possibly able to pick up on nuances of language that statistical n-gram models may not
  7. Utilization of a pre-trained transformer model - has seen tons of data from it's pretrained 107 langs

Cons

  1. Inability for detection scores (Only possibility is using score_batch [this returns a perplexity token log score, not a score sum of 1], but some limited testing of mine found some issues)
  2. One unified model - any finetuning or adding of languages needs to finetune the entire model (meaning finetuning has to show all language data when doing so to prevent catastrophic forgetting)
  3. Possibly more utilization of computer resources (it's a 300m parameter model, so it does need 'some' resources)
  4. Ghost of a chance for model to sometimes output sequences (inferences) that aren't a language code (con of using a seq-seq model vs classification models i suppose) - i need to investigate this further but it did not affect accuracy results at all

Neutral (couldnt choose if it's a con or pro)

  1. Relatively low training time, for my model with support of 97~ language (9.7m corpora total) - really competitive results at 20h training mark [RTX 3090]

Let me know if there's any interest in results or the model, just thought that it's something that should be shared

mT5 paper
CTranslate2 Docs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.