Provide the file <a href="https://github.com/Ousret/charset_norma

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[DETECTION] Any CP866 encoded file detected as CP1125 about charset_normalizer HOT 6 CLOSED

PalmtopTiger commented on July 19, 2024

[DETECTION] Any CP866 encoded file detected as CP1125

from charset_normalizer.

Comments (6)

PalmtopTiger commented on July 19, 2024 1

Thanks for the quick response.

The language is detected correctly, no tuning of frequencies is required.

I think it is worth limiting CP1125 to Ukrainian. Sorry, but I don't have the required knowledge to do this. You should look into it when you have time.

from charset_normalizer.

Ousret commented on July 19, 2024 1

I have to say no. After much consideration, and drafts that did not agree with me in the end I am going to close this.
I am willing to accept PR as someone else with a fresh perspective could do something about it.
What I am inclined to do is update the documentation to explain that case a bit more.

from charset_normalizer.

Ousret commented on July 19, 2024

Hi @PalmtopTiger

Indeed those two code pages are very similar. The first course of action would be to verify the assets/__init__.py FREQUENCIES dict. The Russian and Ukrainian char freq order could need some fine-tuning.

If the above did not work, would need (in addition to the above) to limit CP1125 target language to Ukrainian.
Would you take a look? As I have limited time to offer this week.

Regards,

from charset_normalizer.

Ousret commented on July 19, 2024

Hi @PalmtopTiger

I have started working on your case.
With all the files you gave me, the result does not differ at all if the decoder is ibm866 or cp1125.

Even if I did limit the cp1125 to Ukrainian only, it would not matter at all.

    with open("./char-dataset/ibm866/CHEBUR.TXT", "rb") as fp:
        payload = fp.read()

    print(payload.decode("cp1125") == payload.decode("cp866"))  # Print True!

Given the following code:

from charset_normalizer import from_path

if __name__ == "__main__":

    results = from_path(
        r"./char-dataset/ibm866/MGICMASK.TXT",
        explain=True,
        cp_isolation=["cp1125", "ibm866"]
    )

    print(
        str(results.best().could_be_from_charset)  # Will print ["cp1125", "ibm866"]
    )

I have limited the cp1125 to target Ukrainian only.

Logs:

2021-07-31 23:54:38,609 | WARNING | cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : cp1125, ibm866.
2021-07-31 23:54:38,623 | INFO | cp1125 passed initial chaos probing. Mean measured chaos is 0.000000 %
2021-07-31 23:54:38,623 | INFO | cp1125 should target any language(s) of ['Ukrainian']
2021-07-31 23:54:38,626 | INFO | We detected language [('Ukrainian', 0.8243)] using cp1125
2021-07-31 23:54:38,626 | INFO | Using cp1125 code page we detected the following languages: [('Ukrainian', 0.8243)]
2021-07-31 23:54:38,627 | INFO | cp866 passed initial chaos probing. Mean measured chaos is 0.000000 %
2021-07-31 23:54:38,629 | INFO | cp866 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian']
2021-07-31 23:54:38,634 | INFO | We detected language [('Russian', 0.8948), ('Bulgarian', 0.8596), ('Ukrainian', 0.8243), ('Serbian', 0.6974)] using cp866
2021-07-31 23:54:38,635 | INFO | Using cp866 code page we detected the following languages: [('Ukrainian', 0.8243)]

cp866 will be categorized as a sub match of cp1125 as the result is the same.

If both decoders would produce different results, limiting cp1125 to Ukrainian would matter. But in that case, it would not.

But, never the less, there is room for improvement. I have a few ideas.

Patch the sub match factoring and prefer the most coherent one (language detection) on top.
Patch the function cd.encoding_languages to limit returned languages for those specifics charsets.
...Like this.

@lru_cache()
def encoding_languages(iana_name: str) -> List[str]:
    """
    Single-byte encoding language association. Some code page are heavily linked to particular language(s).
    This function does the correspondence.
    """

    # Put too specific code page target language here.
    if iana_name == "cp1125":
        return ["Ukrainian"]

    unicode_ranges = encoding_unicode_range(iana_name)  # type: List[str]
    primary_range = None  # type: Optional[str]
.................

Before I can go any further, I need to identify all the code pages that are similar to the CP1125 case.
If it's one case, code page, I don't think we can produce a PR that addresses a single edge case.

from charset_normalizer.

PalmtopTiger commented on July 19, 2024

An alternative solution: change the probing order so that CP866 is checked before CP1125. Then CP1125 will be marked as an alternative to CP866.
Ideally, all encodings should be probed in order of popularity. This will speed up detection of the most popular encodings. But I did not find detailed data on the popularity of encodings, other than this: https://w3techs.com/technologies/overview/character_encoding. But there is no CP866 or CP1125 at all.
As for the fact that the encodings are the same, this is not the case. If we find a file containing characters that differ in these encodings, then if the wrong encoding is selected, the text will be damaged. I believe that CP866 is more common due to the size of the countries' populations.
Perhaps you are right and there is no point in changing anything. Chardet reports that these files are CP866, and this is the correct answer. On the other hand, CP1125 is valid for both Russian and Ukrainian texts, and CP866 is valid for Russian text with symbols of European languages, which is a much rarer case.

from charset_normalizer.

Ousret commented on July 19, 2024

An alternative solution: change the probing order so that CP866 is checked before CP1125.

This will not happen. I do not want to disturb the actual order. (Provided by from encodings.aliases import aliases Ordered alphabetically)

I saw many people miss interpreting https://w3techs.com/technologies/overview/character_encoding for the charset detection context.
Neither Chardet nor Charset-Normalizer is an HTML charset detector. The detection concern every text document (SubRip Subtitle for ex.) And it is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.

As for the fact that the encodings are the same, this is not the case.
Chardet reports that these files are CP866, and this is the correct answer.

Maybe I expressed myself wrongly. What I said is that with the files you gave me, there is no wrong answer. And Chardet is nowhere near supporting the same amount of CP.

I believe that CP866 is more common due to the size of the countries' populations.

Yes, I believe so. But I cannot maintain an order based on popularity. I need a perpetual objective way to doing things.

Perhaps you are right and there is no point in changing anything.

In a matter of urgency, no. But I concur that something can be improved. To me, the fact that the property could_be_from_charset output both CP1125 and IBM866 is sufficient.

As stated in the introduction https://charset-normalizer.readthedocs.io/en/latest/ I strongly advise you to expect the text to be decoded correctly, do not search for the originating encoding. Can be pointless.

from charset_normalizer.

[DETECTION] Any CP866 encoded file detected as CP1125 about charset_normalizer HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent