Notice I hereby announce that my raw input is not :

[DETECTION] Incorrect natural language detection about charset_normalizer HOT 4 CLOSED

cdelledonne commented on August 18, 2024

[DETECTION] Incorrect natural language detection

from charset_normalizer.

Comments (4)

Ousret commented on August 18, 2024 1

I have improved the language detector in v3.0 (rc1),
still, it is not as good as a dedicated language detector (ngrams) and will (likely) will never be.
I consider this issue to be addressed.

from charset_normalizer.

Ousret commented on August 18, 2024

Investigation around the natural language detection

This library, by default, extracts five chunks. Of those, here is the characters' analysis debug output.
First, yes, I could reproduce your case.

What the library sees

Chunk 1

DEBUG COMMON [('o', 28), ('i', 22), ('a', 20), ('e', 19), ('r', 17), ('n', 16), ('l', 13), ('t', 9), ('v', 8), ('s', 6), ('g', 6), ('d', 5), ('p', 5), ('b', 5), ('u', 5), ('c', 5), ('m', 4), ('z', 2), ('h', 2), ('f', 1)]
FOUND [('English', 1.0), ('Dutch', 1.0), ('German', 0.95)]

Chunk 2

DEBUG COMMON [('i', 24), ('e', 23), ('o', 17), ('r', 15), ('n', 14), ('g', 11), ('a', 11), ('s', 11), ('t', 10), ('l', 5), ('c', 5), ('z', 4), ('d', 4), ('u', 4), ('m', 4), ('h', 4), ('v', 3), ('p', 3), ('b', 2), ('à', 1), ('f', 1), ('è', 1)]
FOUND [('Italian', 1.0), ('French', 0.9545), ('German', 0.9091)]

Chunk 3

DEBUG COMMON [('i', 45), ('e', 28), ('o', 27), ('n', 21), ('a', 20), ('r', 20), ('s', 17), ('l', 12), ('t', 12), ('m', 8), ('c', 7), ('d', 7), ('z', 6), ('p', 5), ('v', 5), ('b', 4), ('g', 4), ('u', 3), ('q', 2), ('h', 2), ('f', 1), ('é', 1), ('è', 1)]
FOUND [('French', 0.9565), ('Italian', 0.9565), ('Portuguese', 0.9565)]

Chunk 4

DEBUG COMMON [('e', 33), ('i', 33), ('o', 23), ('s', 22), ('a', 18), ('t', 17), ('n', 16), ('u', 11), ('r', 11), ('c', 10), ('m', 9), ('l', 8), ('d', 7), ('p', 6), ('f', 4), ('v', 3), ('q', 3), ('b', 3), ('h', 2), ('è', 1), ('z', 1)]
FOUND [('Italian', 1.0), ('French', 0.9524), ('Spanish', 0.9524)]

Chunk 5

DEBUG COMMON [('o', 31), ('e', 24), ('a', 21), ('n', 18), ('l', 17), ('r', 15), ('s', 13), ('t', 13), ('c', 10), ('i', 9), ('m', 8), ('d', 8), ('u', 6), ('b', 5), ('v', 4), ('p', 2), ('z', 2), ('f', 2), ('q', 1), ('g', 1)]
FOUND [('English', 1.0), ('Italian', 1.0), ('Spanish', 1.0)]

Why does it say that

Based on the characters' frequencies bellow

"English": [
        "e",
        "a",
        "t",
        "i",
        "o",
        "n",
        "s",
        "r",
        "h",
        "l",
        "d",
        "c",
        "u",
        "m",
        "f",
        "p",
        "g",
        "w",
        "y",
        "b",
        "v",
        "k",
        "x",
        "j",
        "z",
        "q",
    ],
...

"Italian": [
        "e",
        "i",
        "a",
        "o",
        "n",
        "l",
        "t",
        "r",
        "s",
        "c",
        "d",
        "u",
        "p",
        "m",
        "g",
        "v",
        "f",
        "b",
        "z",
        "h",
        "q",
        "è",
        "à",
        "k",
        "y",
        "ò",
    ],

AND

def characters_popularity_compare(
    language: str, ordered_characters: List[str]
) -> float:
    """
    Determine if an ordered characters list (by occurrence from most appearances to rarest) matches a particular language.
    The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
    Beware that its function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
    """

We try to compare our extraction with it but NOT in a strict way. That's why the Latin-based lng might be sometime entangled.

The main goal of charset-normalizer is still to offer you the best suiting character encoding.
Natural language detection is a secondary aspect. But still, we may need to find some non-breaking way to improve it.

Being stricter on natural language detection is counterintuitive to our main goal (in most cases).
You may argue that our natural language detection is more inclined toward finding intelligent design first. (I would agree)

What can we do?

My first idea going forward is to patch the function characters_popularity_compare to be a bit less lax.
Or switch to using NGrams but less confident about the performance's outcome.
Or have the function merge_coherence_ratios improved to reflect better the probable language used first?

What can you do immediately?

Using a dedicated natural language (ngram) detector even if it slows down your process. I infer that you use the language to rename the file using the proper LG.SRT where LG = language iso 2-char.
And sharing the complete dataset would help much.

When?

I am not confident tweaking this as of right now, I must do some thorough thinking and planning first. Tho' contributions are
welcomed.

Hope that explains things.

from charset_normalizer.

Ousret commented on August 18, 2024

Could you try against your dataset of srt the branch patch-lg-detect-hotfix ?

from charset_normalizer.

cdelledonne commented on August 18, 2024

Thank you so much for the very thorough explanation, I appreciate that.

I tried your new branch and it does produce a correct result for the subtitle file that I shared, but the results for my database of subtitles are all over the place unfortunately. Here's my complete database of subtitles, hope it can be of some help:
subsdb.zip

Most subtitles have a two-letter language code in their name, so it's easy to verify the correctness of this library against those files. Other files don't have a language code, and as you guessed I was hoping to use this library to rename them.

from charset_normalizer.

[DETECTION] Incorrect natural language detection about charset_normalizer HOT 4 CLOSED

Comments (4)

Investigation around the natural language detection

What the library sees

Chunk 1

Chunk 2

Chunk 3

Chunk 4

Chunk 5

Why does it say that

What can we do?

What can you do immediately?

When?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent