Giter Site home page Giter Site logo

jawah / charset_normalizer Goto Github PK

View Code? Open in Web Editor NEW
544.0 9.0 49.0 1.29 MB

Truly universal encoding detector in pure Python

Home Page: https://charset-normalizer.readthedocs.io/en/latest/

License: MIT License

Python 99.66% Shell 0.34%
chardet encoding charset-conversion unicode python charset-detection

charset_normalizer's Introduction

Charset Detection, for Everyone πŸ‘‹

The Real First Universal Charset Detector
Download Count Total

Featured Packages
Static Badge Static Badge

In other language (unofficial port - by the community)
Static Badge

A library that helps you read text from an unknown charset encoding.
Motivated by chardet, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.

>>>>> πŸ‘‰ Try Me Online Now, Then Adopt Me πŸ‘ˆ <<<<<

This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.

Feature Chardet Charset Normalizer cChardet
Fast ❌ βœ… βœ…
Universal** ❌ βœ… ❌
Reliable without distinguishable standards ❌ βœ… βœ…
Reliable with distinguishable standards βœ… βœ… βœ…
License LGPL-2.1
restrictive
MIT MPL-1.1
restrictive
Native Python βœ… βœ… ❌
Detect spoken language ❌ βœ… N/A
UnicodeDecodeError Safety ❌ βœ… ❌
Whl Size (min) 193.6 kB 42 kB ~200 kB
Supported Encoding 33 πŸŽ‰ 99 40

Reading Normalized TextCat Reading Text

** : They are clearly using specific code for a specific encoding even if covering most of used one
Did you got there because of the logs? See https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html

⚑ Performance

This package offer better performance than its counterpart Chardet. Here are some numbers.

Package Accuracy Mean per file (ms) File per sec (est)
chardet 86 % 200 ms 5 file/sec
charset-normalizer 98 % 10 ms 100 file/sec
Package 99th percentile 95th percentile 50th percentile
chardet 1200 ms 287 ms 23 ms
charset-normalizer 100 ms 50 ms 5 ms

Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.

Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows. And yes, these results might change at any time. The dataset can be updated to include more files. The actual delays heavily depends on your CPU capabilities. The factors should remain the same. Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability (eg. Supported Encoding) Challenge-them if you want.

✨ Installation

Using pip:

pip install charset-normalizer -U

πŸš€ Basic Usage

CLI

This package comes with a CLI.

usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
                  file [file ...]

The Real First Universal Charset Detector. Discover originating encoding used
on text file. Normalize text to unicode.

positional arguments:
  files                 File(s) to be analysed

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Display complementary information about file if any.
                        Stdout will contain logs about the detection process.
  -a, --with-alternative
                        Output complementary possibilities if any. Top-level
                        JSON WILL be a list.
  -n, --normalize       Permit to normalize input file. If not set, program
                        does not write anything.
  -m, --minimal         Only output the charset detected to STDOUT. Disabling
                        JSON output.
  -r, --replace         Replace file when trying to normalize it instead of
                        creating a new one.
  -f, --force           Replace file without asking if you are sure, use this
                        flag with caution.
  -t THRESHOLD, --threshold THRESHOLD
                        Define a custom maximum amount of chaos allowed in
                        decoded content. 0. <= chaos <= 1.
  --version             Show version information and exit.
normalizer ./data/sample.1.fr.srt

or

python -m charset_normalizer ./data/sample.1.fr.srt

πŸŽ‰ Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

Python

Just print out normalized text

from charset_normalizer import from_path

results = from_path('./my_subtitle.srt')

print(str(results.best()))

Upgrade your code without effort

from charset_normalizer import detect

The above code will behave the same as chardet. We ensure that we offer the best (reasonable) BC result possible.

See the docs for advanced usage : readthedocs.io

πŸ˜‡ Why

When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a reliable alternative using a completely different method. Also! I never back down on a good challenge!

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

In a way, I'm brute forcing text decoding. How cool is that ? 😎

Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.

🍰 How

  • Discard all charset encoding table that could not fit the binary content.
  • Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
  • Extract matches with the lowest mess detected.
  • Additionally, we measure coherence / probe for a language.

Wait a minute, what is noise/mess and coherence according to YOU ?

Noise : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to improve or rewrite it.

Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

⚑ Known limitations

  • Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
  • Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

⚠️ About Python EOLs

If you are running:

  • Python >=2.7,<3.5: Unsupported
  • Python 3.5: charset-normalizer < 2.1
  • Python 3.6: charset-normalizer < 3.1
  • Python 3.7: charset-normalizer < 4.0

Upgrade your Python interpreter as soon as possible.

πŸ‘€ Contributing

Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.

πŸ“ License

Copyright Β© Ahmed TAHRI @Ousret.
This project is MIT licensed.

Characters frequencies used in this project Β© 2012 Denny VrandečiΔ‡

πŸ’Ό For Enterprise

Professional support for charset-normalizer is available as part of the Tidelift Subscription. Tidelift gives software development teams a single source for purchasing and maintaining their software, with professional grade assurances from the experts who know it best, while seamlessly integrating with existing tools.

charset_normalizer's People

Contributors

adbar avatar akx avatar aleksandernovikov avatar blkserene avatar deedy5 avatar dependabot[bot] avatar fantasquex avatar frenzymadness avatar hugovk avatar jayvdb avatar kianmeng avatar nijel avatar nmaynes avatar oleksandr-kuzmenko avatar ousret avatar pinterior avatar pythoncoderas avatar step-security-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

charset_normalizer's Issues

[DETECTION] no encoding found, contrarily to chardet and cchardet

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.

Verbose output

2021-09-17 13:08:23,491 | INFO | Detected declarative mark in sequence. Priority +1 given for utf_8.
2021-09-17 13:08:23,491 | WARNING | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xe4 in position 2531: invalid continuation byte
2021-09-17 13:08:23,492 | WARNING | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xe4 in position 2531: ordinal not in range(128)
2021-09-17 13:08:23,493 | WARNING | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,494 | WARNING | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,496 | WARNING | cp037 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 526.200000 %.
2021-09-17 13:08:23,496 | WARNING | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,508 | WARNING | cp1125 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,509 | WARNING | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,509 | WARNING | cp1250 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,510 | WARNING | cp1251 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,511 | WARNING | cp1252 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,512 | WARNING | cp1253 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,513 | WARNING | cp1254 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,513 | WARNING | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,514 | WARNING | cp1256 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,515 | WARNING | cp1257 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,515 | WARNING | cp1258 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,516 | WARNING | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,517 | WARNING | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x70 in position 31: character maps to <undefined>
2021-09-17 13:08:23,517 | WARNING | cp437 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,518 | WARNING | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,519 | WARNING | cp775 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,520 | WARNING | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,521 | WARNING | cp852 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,522 | WARNING | cp855 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,523 | WARNING | cp857 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,524 | WARNING | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,525 | WARNING | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,525 | WARNING | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,526 | WARNING | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,526 | WARNING | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,527 | WARNING | cp864 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,528 | WARNING | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,528 | WARNING | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,529 | WARNING | Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x84 in position 187587: character maps to <undefined>
2021-09-17 13:08:23,530 | WARNING | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,530 | WARNING | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,531 | WARNING | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,531 | WARNING | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,531 | WARNING | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,532 | WARNING | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,532 | WARNING | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,533 | WARNING | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,533 | WARNING | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,534 | WARNING | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,535 | WARNING | hp_roman8 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,535 | WARNING | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,536 | WARNING | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,536 | WARNING | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,536 | WARNING | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,537 | WARNING | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,537 | WARNING | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,537 | WARNING | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,538 | WARNING | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xe4 in position 2531: illegal multibyte sequence
2021-09-17 13:08:23,538 | WARNING | iso8859_10 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,539 | WARNING | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,539 | WARNING | iso8859_13 is deemed too similar to code page cp1257 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,540 | WARNING | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,541 | WARNING | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,541 | WARNING | iso8859_16 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,542 | WARNING | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,543 | WARNING | iso8859_3 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,543 | WARNING | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,544 | WARNING | iso8859_5 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,544 | WARNING | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,545 | WARNING | iso8859_7 is deemed too similar to code page cp1253 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,545 | WARNING | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd6 in position 22977: character maps to <undefined>
2021-09-17 13:08:23,546 | WARNING | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,546 | WARNING | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xd6 in position 22977: illegal multibyte sequence
2021-09-17 13:08:23,547 | WARNING | koi8_r was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,547 | WARNING | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,547 | WARNING | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,548 | WARNING | mac_cyrillic was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,549 | WARNING | mac_greek was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,549 | WARNING | mac_iceland was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,550 | WARNING | mac_latin2 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.967000 %.
2021-09-17 13:08:23,551 | WARNING | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2021-09-17 13:08:23,551 | WARNING | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2021-09-17 13:08:23,552 | WARNING | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2021-09-17 13:08:23,552 | WARNING | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xfc in position 27411: illegal multibyte sequence
2021-09-17 13:08:23,554 | WARNING | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,555 | WARNING | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xa0 in position 161824: illegal multibyte sequence
2021-09-17 13:08:23,556 | WARNING | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 27411: character maps to <undefined>
2021-09-17 13:08:23,556 | INFO | Encoding utf_16 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2021-09-17 13:08:23,556 | WARNING | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 161318-161319: illegal encoding
2021-09-17 13:08:23,556 | WARNING | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 161560-161561: illegal encoding
2021-09-17 13:08:23,556 | INFO | Encoding utf_32 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2021-09-17 13:08:23,557 | WARNING | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2021-09-17 13:08:23,557 | WARNING | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2021-09-17 13:08:23,557 | WARNING | Code page utf_7 does not fit given bytes sequence at ALL. 'utf7' codec can't decode byte 0xe4 in position 2531: unexpected special character
Unable to identify originating encoding for "anzeige-value-stars-mit-ausgewaehlten-aktien-den-dax-schlagen-5873873". Maybe try increasing maximum amount of chaos.
{
    "path": "/home/adbar/anzeige-value-stars-mit-ausgewaehlten-aktien-den-dax-schlagen-5873873",
    "encoding": null,
    "encoding_aliases": [],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [],
    "has_sig_or_bom": false,
    "chaos": 1.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

chardet and cchardet both agree on windows-1252 but I'm not certain.

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.6.9
  • Package version 2.0.5

Additional context

Your package looks nice! I'm currently testing it with edge cases, i.e. HTML documents with strange or inconsistent encodings.

The issue is also referenced here: adbar/trafilatura#79

[BUG] 2.0.11: pytest is failing

Describe the bug
Looks like pytest is failing in two units.

To Reproduce
I'm trying to package your module as an rpm package. So I'm using the typical PEP517 based build, install and test cycle used on building packages from non-root account.

  • python3 -sBm build -w --no-isolation
  • because I'm calling build with --no-isolation I'm using during all processes only locally installed modules
  • install .whl file in </install/prefix>
  • run pytest with PYTHONPATH pointing to sitearch and sitelib inside </install/prefix>

Expected behavior
pytest shoud pass without errors/fails.

Logs
Here is pytest output:

+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-charset-normalizer-2.0.11-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-charset-normalizer-2.0.11-2.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.11, configfile: setup.cfg
plugins: flaky-3.7.0, forked-1.4.0, shutil-1.7.0, cov-3.0.0, virtualenv-1.7.0, flake8-1.0.7, xdist-2.5.0, checkdocs-2.7.1, pytest_check-1.0.4
collected 128 items

. .                                                                                                                                                                  [  0%]
tests/test_base_detection.py ..........................                                                                                                              [ 21%]
tests/test_cli.py ............                                                                                                                                       [ 30%]
tests/test_coherence_detection.py ...............                                                                                                                    [ 42%]
tests/test_detect_legacy.py ....                                                                                                                                     [ 45%]
tests/test_edge_case.py .                                                                                                                                            [ 46%]
tests/test_full_detection.py .................                                                                                                                       [ 59%]
tests/test_large_payload.py ...                                                                                                                                      [ 62%]
tests/test_logging.py FF..                                                                                                                                           [ 65%]
tests/test_mess_detection.py ..........                                                                                                                              [ 73%]
tests/test_normalize_fp.py .                                                                                                                                         [ 74%]
tests/test_preemptive_detection.py ..........                                                                                                                        [ 81%]
tests/test_utils.py .......................                                                                                                                          [100%]

================================================================================= FAILURES =================================================================================
_____________________________________________________________ TestLogBehaviorClass.test_explain_true_behavior ______________________________________________________________

self = <tests.test_logging.TestLogBehaviorClass object at 0x7fac862858b0>, caplog = <_pytest.logging.LogCaptureFixture object at 0x7fac86285d00>

    def test_explain_true_behavior(self, caplog):
        test_sequence = b'This is a test sequence of bytes that should be sufficient'
        from_bytes(test_sequence, steps=1, chunk_size=50, explain=True)
        assert explain_handler not in self.logger.handlers
        for record in caplog.records:
>           assert record.levelname in ["Level 5", "DEBUG"]
E           assert 'VERBOSE' in ['Level 5', 'DEBUG']
E            +  where 'VERBOSE' = <LogRecord: charset_normalizer, 5, /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.11/charset_normalizer/api.py, 394, "%s passed initial chaos probing. Mean measured chaos is %f %%">.levelname

tests/test_logging.py:21: AssertionError
--------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
2022-02-03 11:54:52,705 | VERBOSE | ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-02-03 11:54:52,705 | VERBOSE | ascii should target any language(s) of ['Latin Based']
2022-02-03 11:54:52,706 | DEBUG | Encoding detection: ascii is most likely the one.
---------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
VERBOSE  charset_normalizer:api.py:394 ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
VERBOSE  charset_normalizer:api.py:407 ascii should target any language(s) of ['Latin Based']
DEBUG    charset_normalizer:api.py:451 Encoding detection: ascii is most likely the one.
_______________________________________________________ TestLogBehaviorClass.test_explain_false_handler_set_behavior _______________________________________________________

self = <tests.test_logging.TestLogBehaviorClass object at 0x7fac86275b20>, caplog = <_pytest.logging.LogCaptureFixture object at 0x7fac862756a0>

    def test_explain_false_handler_set_behavior(self, caplog):
        test_sequence = b'This is a test sequence of bytes that should be sufficient'
        set_logging_handler(level=TRACE, format_string="%(message)s")
        from_bytes(test_sequence, steps=1, chunk_size=50, explain=False)
        assert any(isinstance(hdl, logging.StreamHandler) for hdl in self.logger.handlers)
        for record in caplog.records:
>           assert record.levelname in ["Level 5", "DEBUG"]
E           assert 'VERBOSE' in ['Level 5', 'DEBUG']
E            +  where 'VERBOSE' = <LogRecord: charset_normalizer, 5, /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.11/charset_normalizer/api.py, 394, "%s passed initial chaos probing. Mean measured chaos is %f %%">.levelname

tests/test_logging.py:29: AssertionError
--------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
ascii should target any language(s) of ['Latin Based']
Encoding detection: ascii is most likely the one.
---------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
VERBOSE  charset_normalizer:api.py:394 ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
VERBOSE  charset_normalizer:api.py:407 ascii should target any language(s) of ['Latin Based']
DEBUG    charset_normalizer:api.py:451 Encoding detection: ascii is most likely the one.

---------- coverage: platform linux, python 3.8.12-final-0 -----------
Name                                    Stmts   Miss  Cover   Missing
---------------------------------------------------------------------
charset_normalizer/__init__.py              9      0   100%
charset_normalizer/api.py                 222     29    87%   66, 82-83, 98-104, 120, 147-148, 186, 209-215, 302-311, 327, 387, 389, 466-467, 478-482, 494-496, 505, 593
charset_normalizer/assets/__init__.py       2      0   100%
charset_normalizer/cd.py                  163      6    96%   25, 40, 175, 213-214, 241
charset_normalizer/cli/__init__.py          0      0   100%
charset_normalizer/cli/normalizer.py       91     27    70%   27, 30-33, 39, 43, 139-140, 151-160, 197-199, 222-230, 238-250, 257-261, 290
charset_normalizer/constant.py             23      0   100%
charset_normalizer/legacy.py               31      1    97%   26
charset_normalizer/md.py                  273     11    96%   96, 156, 240, 357-358, 444, 456, 465, 498, 502, 512
charset_normalizer/models.py              195     27    86%   42, 54, 63, 70, 79-83, 91-95, 103-109, 114, 118, 122, 143, 162, 174, 219, 223, 250, 287, 295, 301, 315, 334, 341, 392
charset_normalizer/utils.py               187     28    85%   30-31, 46, 71-72, 109, 140-142, 159-160, 169-170, 179-180, 189-190, 205, 279-282, 286-296, 302
charset_normalizer/version.py               2      0   100%
---------------------------------------------------------------------
TOTAL                                    1198    129    89%

========================================================================= short test summary info ==========================================================================
FAILED tests/test_logging.py::TestLogBehaviorClass::test_explain_true_behavior - assert 'VERBOSE' in ['Level 5', 'DEBUG']
FAILED tests/test_logging.py::TestLogBehaviorClass::test_explain_false_handler_set_behavior - assert 'VERBOSE' in ['Level 5', 'DEBUG']
====================================================================== 2 failed, 125 passed in 15.73s ======================================================================

Desktop (please complete the following information):

  • OS: Linux x86/64
  • Python version 3.8.12
  • Package version 2.0.11

Additional context
N/A

charset_normalizer logging behavior

Hi @Ousret,

This is a bit of a continuation of #145. I wanted to start a discussion on the current logging levels and why they were chosen to better understand the use case/design decision. Most of that wasn't covered in the previous issue. I'd originally read this as being a DEBUG level log but realized I was mistaken, as it's INFO.

What do you envision the common case for logging these messages as INFO (there are more but we'll start here) [1][2][3][4]? What would the user be expected to do with the info provided? They seem like more of a stream of consciousness on what the hot path for the charset_normalizer is doing, rather than noting novel events. I'd personally not expect this to be relevant for general library usage. It probably becomes less relevant to libraries integrating with the project.

Currently, that would result in somewhere around 3 MB of logs per hour at 1 TPS which scales out to a couple gigabytes a month. While that's not huge, it's not trivial either. If you start to scale that up to 100s of TPS, we start recording closer to 250-500GB/mo. That's a lot of IO and potential disk space for long lived logs.

2.0.8 way too verbose

>>> import requests
>>> requests.__version__
'2.26.0'
>>> import charset_normalizer
>>> charset_normalizer.__version__
'2.0.8'

windows10, ubuntu 20.04 lts

I need to set this in all files just to get rid of it:

logging.getLogger('charset_normalizer').setLevel(logging.FATAL)

the first and the last are mine, in the middle just hyper verbose, definetely not info, not even debug.

20211125112541|INFO|operator users:{'username': '15918697710', 'phone': '+4799113717', 'personstatus': 'bosatt', 'fornavn': 'VIS', 'mellomnavn': None, 'etternavn': 'FORDYPNING'}

20211125112541|WARNING|override steps (5) and chunk_size (512) as content does not fit (465 byte(s) given) parameters.
20211125112541|INFO|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20211125112541|INFO|ascii should target any language(s) of ['Latin Based']
20211125112541|INFO|We detected language [('English', 1.0), ('Dutch', 1.0), ('Indonesian', 1.0)] using ascii
20211125112541|INFO|ascii is most likely the one. Stopping the process.
20211125112541|WARNING|override steps (5) and chunk_size (512) as content does not fit (465 byte(s) given) parameters.
20211125112541|INFO|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20211125112541|INFO|ascii should target any language(s) of ['Latin Based']
20211125112541|INFO|We detected language [('English', 1.0), ('Dutch', 1.0), ('Indonesian', 1.0)] using ascii
20211125112541|INFO|ascii is most likely the one. Stopping the process.

20211125112541|INFO|added mobile: {'result': 0, 'description': 'success', 'dn': '', 'message': '', 'referrals': None, 'type': 'modifyResponse'}

[Proposal] Unicode language space detection

Is your feature request related to a problem? Please describe.
Yes, I am trying to create a filter to detect whether or not a string CAN be a part of a specific language. e.g.Italian should not have Greek or Cyrillic, or even Latin characters with diacritics.

Describe the solution you'd like
A regex-based or codepoint-based system to show what languages is expected to have what alphabet (or not).

Describe alternatives you've considered
Creating my own regex

Additional context
It is used to filter image tags (information string consisting of one or few words) by language so that I can automatically detect languages that don't belong within a language.

[BUG] Use importlib.resources or encode assets into Python

Describe the bug

The assets/ directory shouldn't use the filesystem directly, instead should use importlib.resources or load the data into Python similar to how the idna package does so.

This is to make charset_normalizer work in situations where there isn't a filesystem available like when being run from a zip file/static binary.

[BUG] 2.0.4: pytest warnings

Describe the bug
pytest shows some warnings on testing

To Reproduce
I'm trying to package your module as rpm packag. So I'm using typical in such case build, install and test cycle used on building package from non-root account:

  • "setup.py build"
  • "setup.py install --root </install/prefix>"
  • "pytest with PYTHONPATH pointing to sitearch and sitelib inside </install/prefix>

Expected behavior
No errors or watning printed by pytest.

Logs
May I ask for help because few units are failing:

+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-charset-normalizer-2.0.4-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-charset-normalizer-2.0.4-2.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=1562910644
rootdir: /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.4, configfile: setup.cfg
plugins: forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, expect-1.1.0, flake8-1.0.7, timeout-1.4.2, betamax-0.8.1, freezegun-0.4.2, aspectlib-1.5.2, toolbox-0.5, rerunfailures-9.1.1, requests-mock-1.9.3, cov-2.12.1, pyfakefs-4.5.0, flaky-3.7.0, benchmark-3.4.1, xdist-2.3.0, pylama-7.7.1, datadir-1.3.1, regressions-2.2.0, cases-3.6.3, xprocess-0.18.1, black-0.3.12, anyio-3.3.0, Faker-8.11.0, asyncio-0.15.1, trio-0.7.0, httpbin-1.0.0, subtests-0.5.0, isort-2.0.0, hypothesis-6.14.6, mock-3.6.1, profiling-1.7.0, randomly-3.8.0
collected 34 items

tests/test_inherent_sign.py ...                                                                                                                                      [  8%]
tests/test_language.py .                                                                                                                                             [ 11%]
tests/test_on_byte.py ............                                                                                                                                   [ 47%]
tests/test_probe_coherence.py .                                                                                                                                      [ 50%]
tests/test_unicode_helper.py .                                                                                                                                       [ 52%]
tests/test_detect_legacy.py ....                                                                                                                                     [ 64%]
tests/test_on_file.py .                                                                                                                                              [ 67%]
tests/test_probe_chaos.py ....                                                                                                                                       [ 79%]
tests/test_cli.py .......                                                                                                                                            [100%]

============================================================================= warnings summary =============================================================================
tests/test_on_byte.py::TestBytes::test_on_empty_json
tests/test_on_byte.py::TestBytes::test_too_short_none
  /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.4/charset_normalizer/api.py:105: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
    warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))

tests/test_on_byte.py::TestBytes::test_encode_decode
tests/test_detect_legacy.py::TestDetectLegacy::test_utf8_sig_not_striped
  /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.4/charset_normalizer/api.py:105: UserWarning: Trying to detect encoding from a tiny portion of (14) byte(s).
    warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))

tests/test_on_byte.py::TestBytes::test_alphabets_property_undefined_range
  /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.4/charset_normalizer/api.py:105: UserWarning: Trying to detect encoding from a tiny portion of (7) byte(s).
    warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))

tests/test_on_byte.py::TestBytes::test_empty_str_with_sig_gb18030
  /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.4/charset_normalizer/api.py:105: UserWarning: Trying to detect encoding from a tiny portion of (4) byte(s).
    warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))

tests/test_on_byte.py::TestBytes::test_ensure_u8_fallback
tests/test_on_byte.py::TestBytes::test_empty_str_with_sig_utf8
  /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.0.4/charset_normalizer/api.py:105: UserWarning: Trying to detect encoding from a tiny portion of (3) byte(s).
    warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================================================== 34 passed, 8 warnings in 3.72s ======================================================================
pytest-xprocess reminder::Be sure to terminate the started process by running 'pytest --xkill' if you have not explicitly done so in your fixture with 'xprocess.getinfo(<process_name>).terminate()'.

Desktop (please complete the following information):

  • OS: Linux/x86_64
  • Python version 3.8.11
  • Package version: 2.0.4
  • pytest 6.2.4
    Additional context
    None. Simple I'm not sure is it something important or not and as always it is better to ask :)

[BUG] utf-8 misdetected as cp1256

Describe the bug
File is detected as cp1256 while it is acutally utf-8.

To Reproduce
file.txt (the file is anonymized for privacy reasons)

Expected behavior
utf-8 should be detected.

Logs

$ normalizer /tmp/file.txt 
{
    "path": "/tmp/file.txt",
    "encoding": "cp1256",
    "encoding_aliases": [
        "1256",
        "windows_1256"
    ],
    "alternative_encodings": [],
    "language": "Farsi",
    "alphabets": [
        "Arabic",
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin Extended-B",
        "Latin-1 Supplement",
        "Letterlike Symbols"
    ],
    "has_sig_or_bom": false,
    "chaos": 2.32,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.9.2
  • Package version 2.0.12

Additional context
chardet works fine on this file:

$ chardet /tmp/file.txt 
/tmp/file.txt: utf-8 with confidence 0.99

Receive userWarning: Charset-Normalizer require 'C:\Users\juergens\AppData\Local\Temp\_MEI143042\charset_normalizer\assets/frequencies.json'

Describe the bug
A package has dependency to charset-normalizer 2.0.4
We use Pyinstaller to package our application into an EXEcutable.

Executing the EXE brings this waring:
charset_normalizer\assets_init.py:17: UserWarning: Charset-Normalizer require 'C:\Users\juergens\AppData\Local\Temp_MEI143042\charset_normalizer\assets/frequencies.json' to be existent for language/coherence detection. Detection WILL be weaker._

Expected behavior
No warning.

Desktop (please complete the following information):

  • OS: Windows 10
  • Python version 2.9.1
  • Package version 2.0.4

Additional context
The requested JSON frequencies.json is not found in the Conda package of v2.0.4, nor found in teg Github repo of charset-normalizer.
The folder charset_normalizer\assets only includes the _init_.py file.

The folder Temp_MEI143042 is created by PYinstaller to unpack the EXE into a temp folder before execution.

Copyright status of test data

Can you clarify the copyright status of the test data files in data? There are some things there that look like commercial TV subtitles. I wouldn't assume that you can legally redistribute those.

This came up when reviewing the package contents in Debian.

Ideally, anything that is public domain or freely licensed should be documented with the copyright holder and license. And everything else should be removed from the package. That's a fairly tireless bureaucratic job, but it needs to be done for us to be able to distribute the test content and run the test suite.

[BUG] Wrong encoding detected for empty JSON response

Describe the bug

requests 2.26.0 switched to using charset_normalizer by default under Python 3 (see: googleapis/python-cloud-core#117). After this change, a response constructed with an empty JSON body (b"{}) can no longer unmarshall to the empty dict in its json method.

To Reproduce

>>> import charset_normalizer
>>> empty_json_response = b"{}"
>>> detected = charset_normalizer.detect(empty_json_response)
....nox/unit-3-6/lib/python3.6/site-packages/charset_normalizer/api.py:95: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
  warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))
>>> detected
{'encoding': 'utf_16_be', 'language': '', 'confidence': 1.0}
>>> decoded = empty_json_response.decode(detected["encoding"])
>>> decoded
'η­½'
>>> import json
>>> json.loads(decoded)
/opt/Python-3.6.10/lib/python3.6/json/__init__.py:354: in loads
    return _default_decoder.decode(s)
/opt/Python-3.6.10/lib/python3.6/json/decoder.py:339: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <json.decoder.JSONDecoder object at 0x7fd0b5810860>, s = 'η­½', idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.
    
        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.
    
        """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Expected behavior
I expect to be able unmarshall b"{}" into an empty dict, {}.

Desktop (please complete the following information):

  • OS: Linux
  • Python version: 3.6, 3.7, 3.8, 3.9
  • Package version: 2.0.1

Uncaught Python exception: FileNotFoundError

Hi,
I am a high school student just starting to learn about testing techniques. I found a bug in the latest version (2.0.12) of charset_normalizer while doing some fuzzing using the fuzzing tool Atheris. The reproducing process is shown below:

from charset_normalizer import normalize
normalize("")

Is raising FileNotFoundError with a traceback:

Details
Traceback (most recent call last):
  File "fuzzer_charset-normalizer.py", line 9, in TestOneInput
    normalize(data)
  File "/home/clou5/.local/lib/python3.8/site-packages/charset_normalizer/api.py", line 579, in normalize
    results = from_path(
  File "/home/clou5/.local/lib/python3.8/site-packages/charset_normalizer/api.py", line 554, in from_path
    with open(path, "rb") as fp:
FileNotFoundError: [Errno 2] No such file or directory: b''

My environment

Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import charset_normalizer
>>> print(charset_normalizer.__version__)
2.0.12
>>> 

UTF-8 file detects as 'ascii' is this normal? [DETECTION]

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.
https://drive.google.com/file/d/1-qE5HG1AOKGl8-4R2eQQvhdvTN52mm9-/view?usp=sharing

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

PS C:\CMDLINE\p> normalizer -v 'R:\08-05-22(ANSI).srt'
2022-06-24 20:57:59,218 | Level 5 | ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-06-24 20:57:59,218 | Level 5 | ascii should target any language(s) of ['Latin Based']
2022-06-24 20:57:59,218 | DEBUG | Encoding detection: ascii is most likely the one.
{
    "path": "R:\\08-05-22(ANSI).srt",
    "encoding": "ascii",
    "encoding_aliases": [
        "646",
        "ansi_x3.4_1968",
        "ansi_x3_4_1968",
        "ansi_x3.4_1986",
        "cp367",
        "csascii",
        "ibm367",
        "iso646_us",
        "iso_646.irv_1991",
        "iso_ir_6",
        "us",
        "us_ascii"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

******* Output from my script *******
Input file  :
 R:\08-05-22(ANSI).srt
Output file :
 R:\08-05-22(ANSI).txt

Detected Character Encoding: ascii
Confidence of encoding     : 100.00%
Output will use input encoding
PS C:\CMDLINE\p> normalizer -v 'R:\08-05-22(BOM-SIG).srt'
2022-06-24 20:59:23,682 | Level 5 | Detected a SIG or BOM mark on first 3 byte(s). Priority +1 given for utf_8.
2022-06-24 20:59:23,682 | Level 5 | Code page utf_8 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-06-24 20:59:23,689 | Level 5 | utf_8 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-06-24 20:59:23,693 | Level 5 | We detected language [('English', 1.0), ('Indonesian', 1.0), ('Simple English', 1.0), ('Dutch', 1.0), ('Norwegian', 1.0)] using utf_8
2022-06-24 20:59:23,693 | DEBUG | Encoding detection: utf_8 is most likely the one.
{
    "path": "R:\\08-05-22(BOM-SIG).srt",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character"
    ],
    "has_sig_or_bom": true,
    "chaos": 0.0,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

******* Output from my script *******
Input file  :
 R:\08-05-22(BOM-SIG).srt
Output file :
 R:\08-05-22(BOM-SIG).txt

Detected Character Encoding: UTF-8-SIG
Confidence of encoding     : 100.00%
Output will use input encoding
PS C:\CMDLINE\p> normalizer -v 'R:\08-05-22.srt'
2022-06-24 20:59:50,665 | Level 5 | ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-06-24 20:59:50,666 | Level 5 | ascii should target any language(s) of ['Latin Based']
2022-06-24 20:59:50,666 | DEBUG | Encoding detection: ascii is most likely the one.
{
    "path": "R:\\08-05-22.srt",
    "encoding": "ascii",
    "encoding_aliases": [
        "646",
        "ansi_x3.4_1968",
        "ansi_x3_4_1968",
        "ansi_x3.4_1986",
        "cp367",
        "csascii",
        "ibm367",
        "iso646_us",
        "iso_646.irv_1991",
        "iso_ir_6",
        "us",
        "us_ascii"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

******* Output from my script *******
Input file  :
 R:\08-05-22.srt
Output file :
 R:\08-05-22.txt

Detected Character Encoding: ascii
Confidence of encoding     : 100.00%
Output will use input encoding

Expected encoding
08-05-22.srt should detect as UTF-8

Desktop (please complete the following information):

  • OS: Win 10 x64
  • Python version: 3.10
  • Package version: charset-normalizer 2.0.12

Additional context
I have switched over from cChardet as it unfortunately looks to have become no longer supported/updated. With cChartdet it would return UTF8 as the detected encoding for UTF8 files. With your code it returns 'ascii' as the detected encoding (this is the plain 08-05-22.srt in my zip). With a BOM/SIG encoded file it detects as UTF and calls it 'UTF-8-SIG' as cChardet did. Is 'ascii' the expected response to plain UTF8 files? The reason I use it is to set the encoding on a file read using:

 with open(ifile, 'r', encoding=encoding) as original, open(ofile, 'w', encoding=encset) as new:

Where 'encoding' is the detected encoding, 'encset' is a flag that can force uft8 or the original encoding as desired. At present I am calling your module through your cChardet legacy 'encoding' support i.e.: print('Detected Character Encoding:',encoding)

Python 2 not yet supported

Traceback:
test/test_on_file.py:5: in <module>
    from charset_normalizer import CharsetNormalizerMatches as CnM
charset_normalizer/__init__.py:2: in <module>
    from charset_normalizer.normalizer import CharsetNormalizerMatches, CharsetNormalizerMatch
charset_normalizer/normalizer.py:3: in <module>
    import statistics
E   ImportError: No module named statistics

[Comment] Alpine, AOSC, and Spack packages

Hey @Ousret,

In order to be able to ship HTTPie 2.6.0, I needed to provide charset-normalizer packages for Alpine Linux, AOSC, and Spack.
It is just a small word to let you know, nothing else. On other platforms, the module was already present though :)

FTR, there are pull requests:

Thank you for the easy installation requirements (and the whole module itself, indeed) 🍾

[BUG] Support for custom Python environment that ignore PEP 3120

Describe the bug
With requests library using charset-normalizer I am getting an error when calling Python via User-Defined Transform in SAP BODS:

File "EXPRESSION", line 6, in <module>
File "c:\program files\python39\lib\site-packages\requests\__init__.py", line 48, in <module>
from charset_normalizer import __version__ as charset_normalizer_version
File "c:\program files\python39\lib\site-packages\charset_normalizer\__init__.py", line 11
SyntaxError: Non-ASCII character '\xd1' in file c:\program files\python39\lib\site-packages\charset_normalizer\__init__.py on
line 12, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details.

I am not able to define a source code encoding by placing a magic comment into the source files (either as a first or second line in the file) because the app probably modifies the script by itself (placing # -*- coding: utf-8 -*- doesn't help). The setting of environment variable PYTHONUTF8=1 doesn't help too.

To Reproduce
I am not able to provide code to reproduce the issue, it arises when calling Python via User-Defined Transform in SAP BODS
Please check: apache/superset#15631
This could be the same problem: https://stackoverflow.com/questions/68594538/syntaxerror-non-ascii-character-xd1-in-file-charset-normalizer-init-py-i

Expected behavior
No error - with requests version using chardet library there is no problem. Maybe avoiding non-ASCII characters in init.py could help...?

Logs
Please see the bug description.

Desktop (please complete the following information):

  • OS: Windows 2016 Server
  • Python version 3.9.6
  • Package version 2.0.6
  • Requests version 2.26.0

Additional context
N/A

Issues with encodings used in the Baltics

Describe the bug

  • ISO-8859-15 detected correctly by chardet and cchardet, charset_normalizer says cp850,cp857,cp858
  • WINDOWS-1252 detected correctly by chardet and cchardet, charset_normalizer says cp775,cp850,cp857,cp858

The language detection works in both cases (Estonian).

To Reproduce

Desktop (please complete the following information):

https://charsetnormalizerweb-ousret.vercel.app/

[BUG] prettytable 2.x support

The latest release pins prettytable <2 This makes it incompatible with prettyable distributed by openSUSE, and likely other vendors also. There were not too many changes in prettytable 2.x ; it should be possible to support 1.x and 2.x

The question of algorithm improvement

After fixing some bottlenecks (#183), from the performance test results table I selected those files from the dataset on which the program showed a runtime > 0.1.
performance_comparison_master.xlsx
0 1s

From these files I made a separate dataset
char-dataset_>0.1s.zip

and ran tests on it.


test file
test_0.1s.py

from glob import glob
from os.path import isdir
from charset_normalizer import detect

def performance_compare(size_coeff):
    if not isdir("./char-dataset_>0.1s"):
        print("This script require char-dataset_>0.1s to be cloned on package root directory")
        exit(1)
    for tbt_path in sorted(glob("./char-dataset_>0.1s/**/*.*")):
        with open(tbt_path, "rb") as fp:
            content = fp.read() * size_coeff            
        detect(content)

if __name__ == "__main__":
    performance_compare(1)

1. pprofile

pprofile --format callgrind --out cachegrind.out.0.1s.test test_0.1s.py

pprofile_test_0 1s
cachegrind.out.0.1s.zip

2. vprof heatmap

vprof -c h test_0.1s.py

vprof_heatmap
vprof (5_3_2022 10_48_28 AM).zip

[PLACEHOLDER] Possible unidentified different behaviour in specific env

placeholder

Although it's solved, just wanted to mention that this is indeed a crucial mechanic, since charset_normalizer is still "young". For example, it doesn't work properly on Debian in some cases, while is quite consistent on Windows.

Originally posted by @a-maliarov in psf/requests#5871 (comment)

I ran three VM with different environments without noticing anything different, based on the different workflows CI.
Maybe I missed something.

For now, this is unconfirmed.

@a-maliarov Could you provide us with more details?

[BUG] `UnicodeDecodeError: 'ascii' codec can't decode` when using `detect`

Describe the bug

I think this is likely to be very close to #136, but where I'm using detect instead of safe_open.

To Reproduce

Here's the code:

from charset_normalizer import detect

import sys

fname = sys.argv[1]
encoding = detect(open(fname, "rb").read())["encoding"]

print(encoding)

and I have one "master" input (with is the translation unit of a large C++ program):

(venv) atg@vapvdatg01:/tmp> file=original.in && du -hs $file && wc $file && file $file
9.9M    original.in
  215429   907184 10372250 original.in
original.in: ASCII text, with very long lines (2253)

but I can use https://github.com/jleffler/scc-snapshots to make two further files:

(venv) atg@vapvdatg01:/tmp> file=comments_only.in && du -hs $file && wc $file && file $file
5.4M    comments_only.in
  93821  509098 5572212 comments_only.in
comments_only.in: ASCII text

and

(venv) atg@vapvdatg01:/tmp> file=no_comments.in && du -hs $file && wc $file && file $file
4.7M    no_comments.in
 188307  398128 4891407 no_comments.in
no_comments.in: ASCII text, with very long lines (2253)

If I try to use detect on either of comments_only.in or no_comments.in, then:

(venv) atg@vapvdatg01:/tmp> python3 test.py comments_only.in
utf-8
(venv) atg@vapvdatg01:/tmp> python3 test.py no_comments.in
ascii

However:

(venv) atg@vapvdatg01:/tmp> python3 test.py original.in
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    encoding = detect(open(fname, "rb").read())["encoding"]
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/legacy.py", line 28, in detect
    r = from_bytes(byte_str).best()
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/api.py", line 452, in from_bytes
    and fallback_u8.fingerprint != fallback_ascii.fingerprint
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 274, in fingerprint
    return sha256(self.output()).hexdigest()
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 265, in output
    self._output_payload = str(self).encode(encoding, "replace")
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3026167: ordinal not in range(128)

and:

(venv) atg@vapvdatg01:/tmp> cat no_comments.in comments_only.in > merge.in && python3 test.py merge.in
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    encoding = detect(open(fname, "rb").read())["encoding"]
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/legacy.py", line 28, in detect
    r = from_bytes(byte_str).best()
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/api.py", line 452, in from_bytes
    and fallback_u8.fingerprint != fallback_ascii.fingerprint
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 274, in fingerprint
    return sha256(self.output()).hexdigest()
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 265, in output
    self._output_payload = str(self).encode(encoding, "replace")
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 6450930: ordinal not in range(128)

Expected behavior

Not to crash

Logs

This is the end of the traceback:

  File "test.py", line 6, in <module>
    encoding = detect(open(fname, "rb").read())["encoding"]
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/legacy.py", line 28, in detect
    r = from_bytes(byte_str).best()
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/api.py", line 452, in from_bytes
    and fallback_u8.fingerprint != fallback_ascii.fingerprint
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 274, in fingerprint
    return sha256(self.output()).hexdigest()
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 265, in output
    self._output_payload = str(self).encode(encoding, "replace")
  File "/tmp/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 6450930: ordinal not in range(128)

Desktop (please complete the following information):

  • OS: Linux
  • Python version: 3.8.12
  • Package version: 2.0.10.dev0

Additional context
Add any other context about the problem here.

[BUG] 2.0.5 breaks with python 3.5.2 Linux

Describe the bug

Charset normalizer 2.0.5 breaks with python 3.5.2 on ubuntu:trusty. We have some Jenkins job that have been running successfully for months and then on 9/14 everything broke.

To Reproduce

This is an enterprise application so I can't give you source, but we run

pip3 install hvac --user --disable-pip-version-check
python3 asoc.py

I believe you should be able to recreate with a sample app using hvac and the correct versions.

Expected behavior

python3 asoc.py
- Traceback (most recent call last):
-  File "/home/jenkins/.local/lib/python3.5/site-packages/charset_normalizer/api.py", line 5, in <module>
-   from os import PathLike
- ImportError: cannot import name 'PathLike'

Logs

+ python3 asoc.py
Traceback (most recent call last):
  File "/home/jenkins/.local/lib/python3.5/site-packages/charset_normalizer/api.py", line 5, in <module>
    from os import PathLike
ImportError: cannot import name 'PathLike'

Desktop (please complete the following information):

  • OS: [Linux]
  • Python version [3.5.2]
  • Package version [2.0.5]

[BUG] CLI: local variable 'x_' referenced before assignment

Describe the bug
UnboundLocalError: local variable 'x_' referenced before assignment

To Reproduce
normalizer WISHGEN.TXT
WISHGEN.TXT

Expected behavior
The program does not crash.

Logs

Unable to identify originating encoding for "WISHGEN.TXT". Maybe try increasing maximum amount of chaos.
Traceback (most recent call last):
  File "c:\python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\Scripts\normalizer.exe\__main__.py", line 7, in <module>
  File "c:\python39\lib\site-packages\charset_normalizer\cli\normalizer.py", line 192, in cli_detect
    ] if args.alternatives else x_[0].__dict__,
UnboundLocalError: local variable 'x_' referenced before assignment

Desktop (please complete the following information):

  • Windows 10 x64
  • Python 3.9.6 x64
  • Package version: 2.0.3

Comparison with other products

[Proposal] Incrase language coverage

Is your feature request related to a problem? Please describe.
Not of a problem, more of an enhancement

Describe the solution you'd like
Add other languages from other repos, assuming that they use the Unicode codepoint + n-grams model.

Describe alternatives you've considered


[Proposal] Revise the logger instanciation/initial handlers

Is your feature request related to a problem? Please describe.
The Logger is initialized in charset/api.py in a way that does not allow developers using the library to change the logging level via the root logger. The Python Logging library now makes a NullHandler available to allow library and package developers to manage logging in a way that is flexible for application developers. The (Python 3 documentation)[https://docs.python.org/3/howto/logging.html#configuring-logging-for-a-library] has this snippet:

It is strongly advised that you do not add any handlers other than NullHandler to your library’s loggers. This is because the configuration of handlers is the prerogative of the application developer who uses your library. The application developer knows their target audience and what handlers are most appropriate for their application: if you add handlers β€˜under the hood’, you might well interfere with their ability to carry out unit tests and deliver logs which suit their requirements.

Describe the solution you'd like
I would like to change api.py to set up a NullHandler logger and add a function to allow application developers to set the StreamHandler. The function would be added to the init.py file for charset_normalizer. The existing format would be provided as the default for the StreamHandler. The boto3 library has a nice example of this. Including the function below.

def set_stream_logger(
          name="charset_normalizer",
          level=logging.INFO,
          format_string="%(asctime)s | %(levelname)s | %(message)s",
      ):
  
          logger = logging.getLogger(name)
          logger.setLevel(level)
          handler = logging.StreamHandler()
          handler.setLevel(level)
          formatter = logging.Formatter(format_string)
          handler.setFormatter(formatter)
          logger.addHandler(handler)

Additional context
I am happy to write the PR for this. Just wanted to make sure the developers have not considered this and dismissed the change for a reason unbeknownst to me. Also, please let me know if this is not a good fit for the library. I won't be offended! I promise :)

On debian testing, python 3.8.3rc1, version parsing fails.

I get the following trace on any use of the library (using latest version from pypi)


tabularfile/csv.py:44: in __init__
    self.encoding = charset_normalizer.detect(header).get('encoding')
/tmp/tox-bdauvergne/tabularfile/py3-charsetnormalizer/lib/python3.8/site-packages/charset_normalizer/legacy.py:20: in detect
    r = CnM.from_bytes(byte_str).best().first()
/tmp/tox-bdauvergne/tabularfile/py3-charsetnormalizer/lib/python3.8/site-packages/charset_normalizer/normalizer.py:395: in from_bytes
    py_v = [int(el) for el in python_version_tuple()]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <tuple_iterator object at 0x7f51145cc8e0>

>   py_v = [int(el) for el in python_version_tuple()]
E   ValueError: invalid literal for int() with base 10: '3rc1'

Maybe you could simply use sys.version_info ?

Continuous fuzzing by way of OSS-Fuzz

Hi,

I was wondering if you would like to integrate continuous fuzzing by way of OSS-Fuzz? Fuzzing is a way to automate test-case generation and can be used to find unexpected exceptions in Python. In this PR google/oss-fuzz#8265 I did an initial integration into OSS-Fuzz and the current fuzzer simply targets from_bytes. The fuzzing engine used by OSS-Fuzz is Atheris.

If you would like to integrate, the only thing I need is a list of email(s) that will get access to the data produced by OSS-Fuzz, such as bug reports, coverage reports and more stats. Notice the emails affiliated with the project will be public in the OSS-Fuzz repo, as they will be part of a configuration file.

[Proposal] Performance improvements in loops

Is your feature request related to a problem? Please describe.
Hi, I was wondering if it could be possible to improve the performance of certain loops. For example, you do use list comprehensions but not everywhere. Since you have a speed benchmark you'd see if it works in the comparison with chardet.

Describe the solution you'd like
Here are loops where things could be improved:

Additional context
I could help work on a PR if you're interested.

[BUG] Wrong encoding detected for ascii string

Describe the bug:

Looks like charset_normalizer detects the below ascii string incorrectly as utf_16_le while charset detects it as ascii.

To Reproduce:

>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'

>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}

>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>

Expected behavior:

String should be detected as ascii

Desktop (please complete the following information):

OS: [e.g. Linux, Windows or Mac] Linux, Windows
Python version [e.g. 3.5] 3.8.2

Python 3.11.0b1 huge regression in performance

Huge regression in performance. Initial doubts seen in #188 tests

Then, tried locally with Python 3.10.4-final

============================================================================== 126 passed in 3.43s ===============================================================================

Finally, 3.11.0b1

============================================================================== 126 passed in 20.15s ==============================================================================

Almost FIVE times slower! What happened!?
I do not have time immediately to find out the cause.

[Documentation] sphinx warnings `reference target not found`

On building my packages I'm using sphinx-build command with -n switch which shows warmings about missing references. These are not critical issues.
Here is the output with warnings:

+ /usr/bin/sphinx-build -n -T -b man docs build/sphinx/man
Running Sphinx v5.0.2
WARNING: Invalid configuration value found: 'language = None'. Update your configuration to a valid langauge code. Falling back to 'en' (English).
making output directory... done
WARNING: html_static_path entry '_static' does not exist
building [mo]: targets for 0 po files that are out of date
building [man]: all manpages
updating environment: [new config] 10 added, 0 changed, 0 removed
reading sources... [100%] user/support
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatches:1: WARNING: duplicate object description of charset_normalizer.CharsetMatches, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatches:1: WARNING: duplicate object description of charset_normalizer.models.CharsetMatches, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatches.append:1: WARNING: duplicate object description of charset_normalizer.CharsetMatches.append, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatches.best:1: WARNING: duplicate object description of charset_normalizer.CharsetMatches.best, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatches.first:1: WARNING: duplicate object description of charset_normalizer.CharsetMatches.first, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatch:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatch:1: WARNING: duplicate object description of charset_normalizer.models.CharsetMatch, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatch.best:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.best, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.chaos_secondary_pass:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.chaos_secondary_pass, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.coherence_non_latin:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.coherence_non_latin, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.could_be_from_charset:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.could_be_from_charset, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.encoding_aliases:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.encoding_aliases, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.fingerprint:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.fingerprint, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatch.first:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.first, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.language:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.language, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.languages:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.languages, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.models.CharsetMatch.output:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.output, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.raw:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.raw, other instance in api, use :noindex: for one of them
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.w_counter:1: WARNING: duplicate object description of charset_normalizer.CharsetMatch.w_counter, other instance in api, use :noindex: for one of them
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
writing... python-charset-normalizer.3 { user/support user/getstarted user/advanced_search user/handling_result user/miscellaneous user/cli community/faq community/why_migrate api } /home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.w_counter:: WARNING: py:class reference target not found: collections.Counter
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/api.py:docstring of charset_normalizer.api.from_fp:: WARNING: py:obj reference target not found: typing.BinaryIO
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/api.py:docstring of charset_normalizer.api.from_path:: WARNING: py:class reference target not found: os.PathLike
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/api.py:docstring of charset_normalizer.api.normalize:: WARNING: py:class reference target not found: os.PathLike
/home/tkloczko/rpmbuild/BUILD/charset_normalizer-2.1.0/charset_normalizer/models.py:docstring of charset_normalizer.CharsetMatch.w_counter:: WARNING: py:class reference target not found: collections.Counter
done
build succeeded, 26 warnings.

You can peak on fixes that kind of issues in other projects
latchset/jwcrypto#289
click-contrib/sphinx-click@abc31069
latchset/jwcrypto#289
RDFLib/rdflib-sqlalchemy#95
sissaschool/elementpath@bf869d9e
jaraco/cssutils#21
pywbem/pywbem#2895

[BUG] division by zero

Describe the bug
detect() fails with ZeroDivisionError

To Reproduce

from charset_normalizer import detect
detect(b'\xfe\xff')

# Traceback (most recent call last):
# ...
# File "/testvirtualenv/lib/python3.7/site-packages/charset_normalizer/probe_chaos.py", line 253, in ratio
# r_ = self.total_upper_accent_encountered if self.total_unaccented_letter_encountered / self.total_letter_encountered < 0.5 else 0
# ZeroDivisionError: division by zero

Expected behavior

from charset_normalizer import detect
assert detect(b'\xfe\xff') == {'encoding': None, 'language': '', 'confidence': None} 

Desktop (please complete the following information):

  • OS: Debain stretch 9.11, ArchLinux
  • Python version 3.7

[Proposal] Add module creation with mypyc to speed up

Hello.
I ran some tests to find bottlenecks and speed up the package.
The easiest option, since you are already using mypy, is to compile the module during installation using mypyc.
In this case the acceleration is about 2 times.
Here are the results of the tests using your bin/performance.py file:

------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.03485252343844548s
   --> 99th: 0.2629306570015615s
   --> 95th: 0.14874039799906313s
   --> 50th: 0.02182378301222343s
------------------------------
--> Charset-Normalizer_m Conclusions (Charset-Normalizer, compiled with mypyc )
   --> Avg: 0.01605459922575392s
   --> 99th: 0.12211546800972428s
   --> 95th: 0.06977643301070202s
   --> 50th: 0.009204783011227846s
------------------------------
--> Chardet Conclusions
   --> Avg: 0.12291852888552735s
   --> 99th: 0.6617688919941429s
   --> 95th: 0.17344348499318585s
   --> 50th: 0.023028297000564635s
------------------------------
--> Cchardet Conclusions
   --> Avg: 0.003174804929368931s
   --> 99th: 0.04868195200106129s
   --> 95th: 0.008641656007966958s
   --> 50th: 0.0005420649977168068s

test_log.txt
I think the acceleration would be greater if annotate all functions

Underscore or Dash when pip installing?

Is your feature request related to a problem? Please describe.
A red-flag went of in my head when I saw the inconsistent package spelling (- vs _):

PyPi page:
image

ReadTheDocs:
image

ReadMe.md
image

Describe the solution you'd like
Consistency. Either pip install charset-normalizer or pip install charset_normalizer. I have no preference.

Additional context
There's a lot of nasty malware packages out there... (src). Inconsistency and names that are slightly are a bad sign πŸ”₯
image

[BUG] `UnicodeDecodeError: 'ascii' codec can't decode byte` when using `from_path`

Describe the bug

I have a file such as this:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ file temp.txt
temp.txt: C source, ASCII text
(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ du -hs temp.txt
9.6M    temp.txt
(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ wc temp.txt
  188585  1001674 10000082 temp.txt

and I'm trying to parse it with:

#!/usr/bin/env python3

from charset_normalizer import from_path

file = "temp.txt"

lines = [
    line.strip() for line in str(from_path(file).best()).split("\n")
]

using this version of charset_normalizer:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ echo -e "import charset_normalizer\nprint(charset_normalizer.version.VERSION)" | python
['2', '0', '7']

On the main file, I get this exception:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ ./test.py
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    line.strip() for line in str(from_path(file).best()).split("\n")
  File "/home/avj/clones/compile_commands_processor/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 813828: ordinal not in range(128)

However, it seems that something "weird" goes on at around the 10000082 character mark:

This crashes (file size: 10000082 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188602 good.txt | tail -n 188585 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
10000082 temp.txt
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    line.strip() for line in str(from_path(file).best()).split("\n")
  File "/home/avj/clones/compile_commands_processor/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 813522: ordinal not in range(128)
Command exited with non-zero status 1
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.04
        System time (seconds): 0.03
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 45904
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 5116
        Voluntary context switches: 3
        Involuntary context switches: 3
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

where as this does not finish within 5 seconds (maybe that's reasonable for a ~10 MiB file) (file size: 9999820 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188602 good.txt | tail -n 188584 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
9999820 temp.txt
Command terminated by signal 9
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 806
        Voluntary context switches: 2
        Involuntary context switches: 1
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Now, it would be reasonable to say "okay, but what happens in the one line you've removed?", so we take slightly more head and leave tail alone (file size: 9999847 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188603 good.txt | tail -n 188585 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
9999847 temp.txt
Command terminated by signal 9
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10148
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 772
        Voluntary context switches: 2
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

To Reproduce

Unfortunately, I am not able to immediately share this file -- I tried to use cvise and halfempty on it to find the smallest file, but hit the road-block at around the 10000082 character mark

Expected behavior

I believe that charset_normalizer shouldn't crash with UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path

Desktop (please complete the following information):

  • OS: Linux
  • Python version: 3.8.12
  • Package version 2.0.7

[DETECTION] Incorrect natural language detection

Notice

I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

2001 A Space Odyssey (1968).it.srt.txt

Verbose output

2022-07-19 11:02:12,521 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc3 in position 851: ordinal not in range(128)
2022-07-19 11:02:12,521 | Level 5 | Code page utf_8 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-07-19 11:02:12,527 | Level 5 | utf_8 passed initial chaos probing. Mean measured chaos is 4.660000 %
2022-07-19 11:02:12,529 | Level 5 | We detected language [('English', 1.0), ('Dutch', 1.0), ('Italian', 0.9891), ('Spanish', 0.9762), ('Portuguese', 0.9565), ('French', 0.9545), ('German', 0.9295)] using utf_8
2022-07-19 11:02:12,529 | DEBUG | Encoding detection: utf_8 is most likely the one.
{
    "path": "/home/pianetto/storage/media/movies/2001 A Space Odyssey (1968)/2001 A Space Odyssey (1968).it.srt",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 4.66,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

I use charset_normalizer mostly to detect language of subtitle files. The results are not those expected on a bunch of files, I just posted one here but let me know if you'd like to have more samples. In this particular case, I'm expecting to get Italian as language, but I get English instead.

Desktop

  • OS: Fedora Linux 36
  • Python version: Python 3.10.5 (but tried with 3.8.13 as well)
  • Package version: 2.1.0 (but tried with 2.0.12 as well)

Two test failures with 2.0.11

Describe the bug
test_explain_true_behavior and test_explain_false_handler_set_behavior started to fail here with 2.0.11

Logs

py.test-3.9
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.9.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /var/tmp/paludis/build/dev-python-charset-normalizer-2.0.11/work/PYTHON_ABIS/3.9/charset_normalizer-2.0.11, configfile: setup.cfg
plugins: flake8-1.0.0, expect-1.1.0, flaky-3.7.0, mypy-plugins-1.6.1, subtests-0.5.0, forked-1.3.0, cov-2.8.1, requests-mock-1.9.3, benchmark-3.4.1, mock-3.6.1, hypothesis-6.21.5, xdist-2.4.0, xprocess-0.18.1, localserver-0.5.1, timeout-2.0.2, pyfakefs-4.5.4, asyncio-0.17.2, anyio-3.3.4, services-2.2.1
asyncio: mode=legacy
collected 126 items                                                                                                                                                        

tests/test_base_detection.py ..........................                                                                                                              [ 20%]
tests/test_cli.py ............                                                                                                                                       [ 30%]
tests/test_coherence_detection.py ...............                                                                                                                    [ 42%]
tests/test_detect_legacy.py ....                                                                                                                                     [ 45%]
tests/test_edge_case.py .                                                                                                                                            [ 46%]
tests/test_full_detection.py .................                                                                                                                       [ 59%]
tests/test_large_payload.py ...                                                                                                                                      [ 61%]
tests/test_logging.py FF..                                                                                                                                           [ 65%]
tests/test_mess_detection.py ..........                                                                                                                              [ 73%]
tests/test_normalize_fp.py .                                                                                                                                         [ 73%]
tests/test_preemptive_detection.py ..........                                                                                                                        [ 81%]
tests/test_utils.py Coverage.py warning: No data was collected. (no-data-collected)
.......................                                                                                                                          [100%]

================================================================================= FAILURES =================================================================================
_____________________________________________________________ TestLogBehaviorClass.test_explain_true_behavior ______________________________________________________________

self = <tests.test_logging.TestLogBehaviorClass object at 0x7ffa34383d90>, caplog = <_pytest.logging.LogCaptureFixture object at 0x7ffa34383ac0>
    def test_explain_true_behavior(self, caplog):
        test_sequence = b'This is a test sequence of bytes that should be sufficient'
        from_bytes(test_sequence, steps=1, chunk_size=50, explain=True)
        assert explain_handler not in self.logger.handlers
        for record in caplog.records:
>           assert record.levelname in ["Level 5", "DEBUG"]
E           assert 'VERBOSE' in ['Level 5', 'DEBUG']
E            +  where 'VERBOSE' = <LogRecord: charset_normalizer, 5, /var/tmp/paludis/build/dev-python-charset-normalizer-2.0.11/work/PYTHON_ABIS/3.9/ch...lizer-2.0.11/build/lib/charset_normalizer/api.py, 394, "%s passed initial chaos probing. Mean measured chaos is %f %%">.levelname

tests/test_logging.py:21: AssertionError
--------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
2022-01-30 22:12:09,604 | VERBOSE | ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-01-30 22:12:09,604 | VERBOSE | ascii should target any language(s) of ['Latin Based']
2022-01-30 22:12:09,604 | DEBUG | Encoding detection: ascii is most likely the one.
---------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
VERBOSE  charset_normalizer:api.py:394 ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
VERBOSE  charset_normalizer:api.py:407 ascii should target any language(s) of ['Latin Based']
DEBUG    charset_normalizer:api.py:451 Encoding detection: ascii is most likely the one.
_______________________________________________________ TestLogBehaviorClass.test_explain_false_handler_set_behavior _______________________________________________________

self = <tests.test_logging.TestLogBehaviorClass object at 0x7ffa3431ad30>, caplog = <_pytest.logging.LogCaptureFixture object at 0x7ffa3431aaf0>

    def test_explain_false_handler_set_behavior(self, caplog):
        test_sequence = b'This is a test sequence of bytes that should be sufficient'
        set_logging_handler(level=TRACE, format_string="%(message)s")
        from_bytes(test_sequence, steps=1, chunk_size=50, explain=False)
        assert any(isinstance(hdl, logging.StreamHandler) for hdl in self.logger.handlers)
        for record in caplog.records:
>           assert record.levelname in ["Level 5", "DEBUG"]
E           assert 'VERBOSE' in ['Level 5', 'DEBUG']
E            +  where 'VERBOSE' = <LogRecord: charset_normalizer, 5, /var/tmp/paludis/build/dev-python-charset-normalizer-2.0.11/work/PYTHON_ABIS/3.9/ch...lizer-2.0.11/build/lib/charset_normalizer/api.py, 394, "%s passed initial chaos probing. Mean measured chaos is %f %%">.levelname

tests/test_logging.py:29: AssertionError
--------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
ascii should target any language(s) of ['Latin Based']
Encoding detection: ascii is most likely the one.
---------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
VERBOSE  charset_normalizer:api.py:394 ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
VERBOSE  charset_normalizer:api.py:407 ascii should target any language(s) of ['Latin Based']
DEBUG    charset_normalizer:api.py:451 Encoding detection: ascii is most likely the one.
============================================================================= warnings summary =============================================================================
../../../../../../../../../usr/lib/python3.9/site-packages/pytest_asyncio/plugin.py:191
  /usr/lib/python3.9/site-packages/pytest_asyncio/plugin.py:191: DeprecationWarning: The 'asyncio_mode' default value will change to 'strict' in future, please explicitly use 'asyncio_mode=strict' or 'asyncio_mode=auto' in pytest configuration file.
    config.issue_config_time_warning(LEGACY_MODE, stacklevel=2)

tests/test_cli.py::TestCommandLineInterface::test_force_replace_without_replace
tests/test_detect_legacy.py::TestDetectLegacy::test_detect_dict_keys
  /usr/lib/python3.9/site-packages/pytest_asyncio/plugin.py:317: DeprecationWarning: '@pytest.fixture' is applied to <fixture _make_xunit_fixture.<locals>.fixture, file=/usr/lib/python3.9/site-packages/_pytest/unittest.py, line=144> in 'legacy' mode, please replace it with '@pytest_asyncio.fixture' as a preparation for switching to 'strict' mode (or use 'auto' mode to seamlessly handle all these fixtures as asyncio-driven).
    warnings.warn(

tests/test_logging.py::TestLogBehaviorClass::test_explain_true_behavior
  /usr/lib/python3.9/site-packages/pytest_asyncio/plugin.py:317: DeprecationWarning: '@pytest.fixture' is applied to <fixture caplog, file=/usr/lib/python3.9/site-packages/_pytest/logging.py, line=475> in 'legacy' mode, please replace it with '@pytest_asyncio.fixture' as a preparation for switching to 'strict' mode (or use 'auto' mode to seamlessly handle all these fixtures as asyncio-driven).
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html

---------- coverage: platform linux, python 3.9.10-final-0 -----------
Name                                    Stmts   Miss  Cover   Missing
---------------------------------------------------------------------
charset_normalizer/__init__.py              9      9     0%   2-56
charset_normalizer/api.py                 222    222     0%   1-608
charset_normalizer/assets/__init__.py       2      2     0%   2-4
charset_normalizer/cd.py                  163    163     0%   1-340
charset_normalizer/cli/__init__.py          0      0   100%
charset_normalizer/cli/normalizer.py       91     91     0%   1-290
charset_normalizer/constant.py             23     23     0%   1-503
charset_normalizer/legacy.py               31     31     0%   1-95
charset_normalizer/md.py                  273    273     0%   1-559
charset_normalizer/models.py              195    195     0%   1-392
charset_normalizer/utils.py               187    187     0%   1-342
charset_normalizer/version.py               2      2     0%   5-6
---------------------------------------------------------------------
TOTAL                                    1198   1198     0%

================================================================ 2 failed, 124 passed, 4 warnings in 2.52s =================================================================

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.9.10
  • Package version 2.0.11

Maybe fast, but not correct: charset_normalizer fails to detect UTF-8 *with small content*

I made a quick test:

vstinner@apu$ /bin/cat bla
hΓ©llo world!

vstinner@apu$ hexdump -C bla
00000000  68 c3 a9 6c 6c 6f 20 77  6f 72 6c 64 21 0a        |h..llo world!.|
0000000e

vstinner@apu$ env/bin/python -m charset_normalizer.cli.normalizer bla
+----------+----------+----------+----------------------------------------+-------+-----------+
| Filename | Encoding | Language |               Alphabets                | Chaos | Coherence |
+----------+----------+----------+----------------------------------------+-------+-----------+
|   bla    |   big5   | Unknown  | CJK Unified Ideographs and Basic Latin | 0.0 % |   0.0 %   |
+----------+----------+----------+----------------------------------------+-------+-----------+

vstinner@apu$ file bla
bla: UTF-8 Unicode text

charset_normalizer is wrong: the encoding is UTF-8, not big5.

vstinner@apu$ python3
Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
>>> data=open("bla", "rb").read()
>>> data
b'h\xc3\xa9llo world!\n'

>>> print(ascii(data.decode('utf8')))  # expected result
'h\xe9llo world!\n'
>>> print(ascii(data.decode('big5')))  # wrong
'h\u77c7llo world!\n'

[DETECTION] Issue with encodings used by Asian languages

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

This case:

Other problematic cases:

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

2021-10-12 17:20:36,087 | WARNING | override steps (5) and chunk_size (512) as content does not fit (240 byte(s) given) parameters.
2021-10-12 17:20:36,087 | WARNING | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xe6 in position 2: ordinal not in range(128)
2021-10-12 17:20:36,087 | WARNING | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xe6 in position 2: invalid continuation byte
2021-10-12 17:20:36,089 | WARNING | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,090 | WARNING | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,095 | WARNING | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 311.800000 %.
2021-10-12 17:20:36,096 | WARNING | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,097 | WARNING | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 66.700000 %.
2021-10-12 17:20:36,097 | WARNING | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,098 | WARNING | cp1250 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 12.100000 %.
2021-10-12 17:20:36,099 | WARNING | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 72.700000 %.
2021-10-12 17:20:36,100 | WARNING | cp1252 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 41.900000 %.
2021-10-12 17:20:36,100 | WARNING | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaa in position 25: character maps to <undefined>
2021-10-12 17:20:36,101 | WARNING | cp1254 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 41.900000 %.
2021-10-12 17:20:36,102 | WARNING | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 60: character maps to <undefined>
2021-10-12 17:20:36,102 | WARNING | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2021-10-12 17:20:36,103 | WARNING | cp1257 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 12.100000 %.
2021-10-12 17:20:36,104 | WARNING | cp1258 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,104 | WARNING | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,105 | WARNING | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x75 in position 5: character maps to <undefined>
2021-10-12 17:20:36,105 | WARNING | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 39.300000 %.
2021-10-12 17:20:36,106 | WARNING | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,107 | WARNING | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 12.200000 %.
2021-10-12 17:20:36,108 | WARNING | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,109 | WARNING | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.300000 %.
2021-10-12 17:20:36,110 | WARNING | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 42.400000 %.
2021-10-12 17:20:36,111 | WARNING | Code page cp857 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd5 in position 236: character maps to <undefined>
2021-10-12 17:20:36,111 | WARNING | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,112 | WARNING | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,113 | WARNING | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,113 | WARNING | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,114 | WARNING | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,115 | WARNING | cp864 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.100000 %.
2021-10-12 17:20:36,115 | WARNING | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,116 | WARNING | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,117 | WARNING | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 42.400000 %.
2021-10-12 17:20:36,118 | WARNING | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,120 | WARNING | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,120 | WARNING | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,121 | WARNING | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,121 | WARNING | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,122 | WARNING | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,122 | WARNING | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,124 | WARNING | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,124 | WARNING | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,125 | WARNING | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,126 | WARNING | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 42.400000 %.
2021-10-12 17:20:36,126 | WARNING | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,127 | WARNING | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,128 | WARNING | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,128 | WARNING | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,129 | WARNING | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,129 | WARNING | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,130 | WARNING | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,130 | WARNING | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,131 | WARNING | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 12.100000 %.
2021-10-12 17:20:36,131 | WARNING | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 60: character maps to <undefined>
2021-10-12 17:20:36,132 | WARNING | iso8859_13 is deemed too similar to code page cp1257 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,132 | WARNING | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,133 | WARNING | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,133 | WARNING | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 12.100000 %.
2021-10-12 17:20:36,134 | WARNING | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,134 | WARNING | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 17: character maps to <undefined>
2021-10-12 17:20:36,135 | WARNING | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,135 | WARNING | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 78.800000 %.
2021-10-12 17:20:36,136 | WARNING | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaf in position 6: character maps to <undefined>
2021-10-12 17:20:36,136 | WARNING | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 17: character maps to <undefined>
2021-10-12 17:20:36,137 | WARNING | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd1 in position 47: character maps to <undefined>
2021-10-12 17:20:36,137 | WARNING | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,137 | WARNING | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,138 | WARNING | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 72.700000 %.
2021-10-12 17:20:36,139 | WARNING | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,139 | WARNING | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,140 | WARNING | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 66.700000 %.
2021-10-12 17:20:36,140 | WARNING | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 42.400000 %.
2021-10-12 17:20:36,141 | WARNING | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2021-10-12 17:20:36,142 | WARNING | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 18.200000 %.
2021-10-12 17:20:36,142 | WARNING | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2021-10-12 17:20:36,143 | WARNING | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2021-10-12 17:20:36,143 | WARNING | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2021-10-12 17:20:36,143 | WARNING | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,144 | WARNING | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,144 | WARNING | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence
2021-10-12 17:20:36,145 | WARNING | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 60: character maps to <undefined>
2021-10-12 17:20:36,145 | INFO | Encoding utf_16 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2021-10-12 17:20:36,145 | WARNING | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 76-77: illegal UTF-16 surrogate
2021-10-12 17:20:36,146 | INFO | Code page utf_16_le is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2021-10-12 17:20:36,146 | WARNING | utf_16_le was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 178.400000 %.
2021-10-12 17:20:36,147 | INFO | Encoding utf_32 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2021-10-12 17:20:36,147 | WARNING | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2021-10-12 17:20:36,148 | WARNING | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2021-10-12 17:20:36,148 | WARNING | Code page utf_7 does not fit given bytes sequence at ALL. 'utf7' codec can't decode byte 0xe6 in position 2: unexpected special character
Unable to identify originating encoding for "viscii.txt". Maybe try increasing maximum amount of chaos.
{
    "path": "/home/adbar/Downloads/viscii.txt",
    "encoding": null,
    "encoding_aliases": [],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [],
    "has_sig_or_bom": false,
    "chaos": 1.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}
....

Expected encoding
The expected encoding, VISCII, seems to be unknown to charset_normalizer. In the other cases it doesn't work properly either. The language detection also fails but that is secondary IMHO.

Desktop (please complete the following information):

  • OS: Linux
  • Python version: 3.6.9
  • Package version: 2.0.7

Additional context

cchardet guesses it right in all cases listed above, chardet seems to perform slightly better than charset_normalizer.

[BUG] api.py "sequences: bytes" SyntaxError

Describe the bug
Error raised during module import:

  File \"/opt/custom-envs/cloudbuilder-k8s-py3/lib/python3.9/site-packages/requests/__init__.py\", line 48, in <module>
    from charset_normalizer import __version__ as charset_normalizer_version
  File \"/opt/custom-envs/cloudbuilder-k8s-py3/lib/python3.9/site-packages/charset_normalizer/__init__.py\", line 24, in <module>
    from .api import from_bytes, from_fp, from_path, normalize
  File \"/opt/custom-envs/cloudbuilder-k8s-py3/lib/python3.9/site-packages/charset_normalizer/api.py\", line 36
    sequences: bytes,
             ^
SyntaxError: invalid syntax

To Reproduce

  • install Python 3.9.9
  • use ansible collection kubernetes.core version 2.2.2
  • run a task using kubernetes.core.k8s

Logs
log.txt

Desktop (please complete the following information):

  • OS: Red Hat Enterprise Linux Server release 7.9 (Maipo)
  • Python version: Python 3.9.9
  • Package version: 2.0.9

Wrong encoding detected for url

Describe the bug
It looks like charset_normalizer detects the wrong format.

To Reproduce

from urllib.request import urlopen
import chardet
import cchardet
import charset_normalizer
import magic

with magic.Magic() as m:
    rawdata = urlopen('https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112').read()
    print(chardet.detect(rawdata))
    print(cchardet.detect(rawdata))
    print(charset_normalizer.detect(rawdata))
    print(m.id_buffer(rawdata))

Expected behavior
charset_normalizer detects cp932 which is a Japanese code page. However I don't see Japanese characters in the file. file -i, enca, chardet and cchardet think it is utf-8.

Logs

{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
{'encoding': 'UTF-8', 'confidence': 0.7524999976158142}
{'encoding': 'cp932', 'language': 'English', 'confidence': 0.9533799533799534}
UTF-8 Unicode text, with very long lines

Desktop (please complete the following information):

  • OS: [e.g. Linux, Windows or Mac] Linux
  • Python version [e.g. 3.5] 3.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.