Firstly: THANK YOU! Not least for attempting the daunting task of multi-language detection. I have great hopes for this crate!
Ultimately, say one is creating an Elasticsearch index (my case), and a proportion of your documents do contain mixed languages. Detecting the language of fragments of lines or paragraphs is critically important: if you try and index French text with an English stemmer analyser, this risks making a very defective index, so you're likely in that case to opt not to use a stemmer analyser at all...
Here is an example of output from multi-language detection:
..text:
|Table of Contents
plex knowledge centre1
2022-01-09
plex knowledge centre
- various other locations for Plex info: IT Diary 2019-12, IT Diary 2022-01
- at the current time I was able to install Plex in W10 fairly easily. There is a bit of confusion over my Plex account: email is [email protected]. But on my phone (after installing \"Plex TV\") I entered forgon34 for the pwd. But I think I used f*******9f when setting up on Linux.
installing Plex on Linux (from IT Diary 2022-01)
- from here
chris@M17A:~$ sudo systemctl status plexmediaserver
[sudo] password for chris: [pwd]
|
These are the detection results (languages were English, French, German, Spanish, Latin, Irish):
DetectionResult { start_index: 0, end_index: 76, word_count: 2, language: French } |Table of Contents
plex knowledge centre1
2022-01-09
plex knowledge centre
- |
DetectionResult { start_index: 76, end_index: 277, word_count: 2, language: English } |various other locations for Plex info: IT Diary 2019-12, IT Diary 2022-01
- at the current time I was able to install Plex in W10 fairly easily. There is a bit of confusion over my Plex account: email |
DetectionResult { start_index: 277, end_index: 317, word_count: 2, language: Latin } |is [email protected]. But on my phone |
DetectionResult { start_index: 317, end_index: 423, word_count: 2, language: English } |(after installing \"Plex TV\") I entered forgon34 for the pwd. But I think I used f*******9f when setting |
DetectionResult { start_index: 423, end_index: 470, word_count: 2, language: French } |up on Linux.
installing Plex on Linux (from IT |
DetectionResult { start_index: 470, end_index: 497, word_count: 5, language: English } |Diary 2022-01)
- from here
|
DetectionResult { start_index: 497, end_index: 592, word_count: 2, language: Latin } |chris@M17A:~$ sudo systemctl status plexmediaserver
[sudo] password for chris: [pwd]
|
Obviously this is an analysis of what you might call "jottings", not proper sentences, and indeed "jottings of a technical jargony IT-language kind". But even so, I think there should be some way of "giving up the attempt" and saying "can't make head or tail of this part of your text: REJECT".
Perhaps more importantly, as it currently stands the multi-language detection part of your app doesn't deliver confidence levels in its DetectionResult
s. Without such levels it gets difficult to put these results to any practical use. I can of course then subject each of these DetectionResult
text fragments to a second analysis, for confidence, using the same LanguageDetector
, using the Language
value from detection_result.language()
. This actually makes the above results much more manageable: not surprisingly the supposed French and Latin fragments above turn out to have a very low confidence, the English ones higher.
So I think it'd be nice to incorporate a confidence rating into DetectionResult
. I suspect it'd also help with the previous issue of "False positives with gibberish": i.e. feed in about 10 lines... splits this into multiple language fragments. But all have confidence of 0.2 or less: conclusion: GIBBERISH detected!
Naturally all this is going to take CPU power. But you could probably find some optimisations with incorporating confidence-rating, compared to what I'm doing above, a 2-stage analysis ...