Giter Site home page Giter Site logo

Comments (4)

Ousret avatar Ousret commented on August 18, 2024 1

I have improved the language detector in v3.0 (rc1),
still, it is not as good as a dedicated language detector (ngrams) and will (likely) will never be.
I consider this issue to be addressed.

from charset_normalizer.

Ousret avatar Ousret commented on August 18, 2024

Investigation around the natural language detection

This library, by default, extracts five chunks. Of those, here is the characters' analysis debug output.
First, yes, I could reproduce your case.

What the library sees

Chunk 1

DEBUG COMMON [('o', 28), ('i', 22), ('a', 20), ('e', 19), ('r', 17), ('n', 16), ('l', 13), ('t', 9), ('v', 8), ('s', 6), ('g', 6), ('d', 5), ('p', 5), ('b', 5), ('u', 5), ('c', 5), ('m', 4), ('z', 2), ('h', 2), ('f', 1)]
FOUND [('English', 1.0), ('Dutch', 1.0), ('German', 0.95)]

Chunk 2

DEBUG COMMON [('i', 24), ('e', 23), ('o', 17), ('r', 15), ('n', 14), ('g', 11), ('a', 11), ('s', 11), ('t', 10), ('l', 5), ('c', 5), ('z', 4), ('d', 4), ('u', 4), ('m', 4), ('h', 4), ('v', 3), ('p', 3), ('b', 2), ('à', 1), ('f', 1), ('è', 1)]
FOUND [('Italian', 1.0), ('French', 0.9545), ('German', 0.9091)]

Chunk 3

DEBUG COMMON [('i', 45), ('e', 28), ('o', 27), ('n', 21), ('a', 20), ('r', 20), ('s', 17), ('l', 12), ('t', 12), ('m', 8), ('c', 7), ('d', 7), ('z', 6), ('p', 5), ('v', 5), ('b', 4), ('g', 4), ('u', 3), ('q', 2), ('h', 2), ('f', 1), ('é', 1), ('è', 1)]
FOUND [('French', 0.9565), ('Italian', 0.9565), ('Portuguese', 0.9565)]

Chunk 4

DEBUG COMMON [('e', 33), ('i', 33), ('o', 23), ('s', 22), ('a', 18), ('t', 17), ('n', 16), ('u', 11), ('r', 11), ('c', 10), ('m', 9), ('l', 8), ('d', 7), ('p', 6), ('f', 4), ('v', 3), ('q', 3), ('b', 3), ('h', 2), ('è', 1), ('z', 1)]
FOUND [('Italian', 1.0), ('French', 0.9524), ('Spanish', 0.9524)]

Chunk 5

DEBUG COMMON [('o', 31), ('e', 24), ('a', 21), ('n', 18), ('l', 17), ('r', 15), ('s', 13), ('t', 13), ('c', 10), ('i', 9), ('m', 8), ('d', 8), ('u', 6), ('b', 5), ('v', 4), ('p', 2), ('z', 2), ('f', 2), ('q', 1), ('g', 1)]
FOUND [('English', 1.0), ('Italian', 1.0), ('Spanish', 1.0)]

Why does it say that

Based on the characters' frequencies bellow

"English": [
        "e",
        "a",
        "t",
        "i",
        "o",
        "n",
        "s",
        "r",
        "h",
        "l",
        "d",
        "c",
        "u",
        "m",
        "f",
        "p",
        "g",
        "w",
        "y",
        "b",
        "v",
        "k",
        "x",
        "j",
        "z",
        "q",
    ],
...

"Italian": [
        "e",
        "i",
        "a",
        "o",
        "n",
        "l",
        "t",
        "r",
        "s",
        "c",
        "d",
        "u",
        "p",
        "m",
        "g",
        "v",
        "f",
        "b",
        "z",
        "h",
        "q",
        "è",
        "à",
        "k",
        "y",
        "ò",
    ],

AND

def characters_popularity_compare(
    language: str, ordered_characters: List[str]
) -> float:
    """
    Determine if an ordered characters list (by occurrence from most appearances to rarest) matches a particular language.
    The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
    Beware that its function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
    """

We try to compare our extraction with it but NOT in a strict way. That's why the Latin-based lng might be sometime entangled.

The main goal of charset-normalizer is still to offer you the best suiting character encoding.
Natural language detection is a secondary aspect. But still, we may need to find some non-breaking way to improve it.

Being stricter on natural language detection is counterintuitive to our main goal (in most cases).
You may argue that our natural language detection is more inclined toward finding intelligent design first. (I would agree)

What can we do?

My first idea going forward is to patch the function characters_popularity_compare to be a bit less lax.
Or switch to using NGrams but less confident about the performance's outcome.
Or have the function merge_coherence_ratios improved to reflect better the probable language used first?

What can you do immediately?

Using a dedicated natural language (ngram) detector even if it slows down your process. I infer that you use the language to rename the file using the proper LG.SRT where LG = language iso 2-char.
And sharing the complete dataset would help much.

When?

I am not confident tweaking this as of right now, I must do some thorough thinking and planning first. Tho' contributions are
welcomed.

Hope that explains things.

from charset_normalizer.

Ousret avatar Ousret commented on August 18, 2024

Could you try against your dataset of srt the branch patch-lg-detect-hotfix ?

from charset_normalizer.

cdelledonne avatar cdelledonne commented on August 18, 2024

Thank you so much for the very thorough explanation, I appreciate that.

I tried your new branch and it does produce a correct result for the subtitle file that I shared, but the results for my database of subtitles are all over the place unfortunately. Here's my complete database of subtitles, hope it can be of some help:
subsdb.zip

Most subtitles have a two-letter language code in their name, so it's easy to verify the correctness of this library against those files. Other files don't have a language code, and as you guessed I was hoping to use this library to rename them.

from charset_normalizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.