chardet / chardet Goto Github PK

View Code? Open in Web Editor NEW

2.1K 2.1K 252.0 4.97 MB

Python character encoding detector

License: GNU Lesser General Public License v2.1

Python 98.77% HTML 1.23%

chardet's Introduction

Chardet: The Universal Character Encoding Detector

Detects

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR, Johab (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252, MacRoman (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

Note

Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models.

Requires Python 3.7+.

Installation

Install from PyPI:

pip install chardet

Documentation

For users, docs are now available at https://chardet.readthedocs.io/.

Command-line Tool

chardet comes with a command-line script which reports on the encodings of one or more files:

% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

About

This is a continuation of Mark Pilgrim's excellent original chardet port from C, and Ian Cordasco's charade Python 3-compatible fork.

maintainer: Dan Blanchard

chardet's People

Contributors

Stargazers

Watchers

Forkers

abadger apelisse hashy krachwal bsidhom perkville jonathanherr haven-jeon konsolerr xuzhenqiang78 yiduo frnsys dwoods mgedmin drwrong oparrish keruimin queeup-forks pyyoshi hickford 2xyo pombredanne salesseek coinpayee gdyuldin inglesp alrawi jr69ss atbest zougloub likaiguo lpsinger thedrow equinoxx paweljasinski pombreda jaimiemurdock spyatakov barak-kedem acccounttest gilesbrown ddboline asdfsx godofhavoc graingert azizou ltd-beget felixonmars simudream sergey-ob deskpy hkingz lhh758 rfdiazpr andreip julienvey hestendelin aisxyz candy0427 rrthomas hitflame freedream520 baoluchuling davinirjr jackdied smashingpks pretix wangpanjun security-prince benallums myvoyage bin-zhao ygemici jn7163 adham1993 gaozhichang awesome-python mdamien gsdu8g9 ruhengchen jpz frankkkkk qfan michelpf tadhg-ohiggins guillermogsjc yegorich abuuuu codepongo e42s yy221 khan hanksantford deuteron cjcross chylonczyk lfany charygao jockerboby githuangg

chardet's Issues

detect wrong ON windows

python 2.7.11
chardet 2.3.0
cchardet-1.1.1-cp27-cp27m-win_amd64.whl (md5)
window server 2008 r2 enterprise
system language: chinese simple (操作系统显示语言：简体中文）
test.py is the python code
testfile.cs is the input file (open by notepad, [save as（另存） ] show the encode（编码） is ANSI
testfile.chardet.cs is the output file: decode by chardet.detect(raw)['encoding'] ,then , encode by utf8
testfile.cchardet.cs is the output file: decode by cchardet.detect(raw)['encoding'] ,then , encode by utf8
testfile.GB2312.cs is the output file: decode by 'GB2312' ,then , encode by utf8.

testfile.GB2312.cs is the RIGHT.
test_chardet.zip

Rename main branches to prevent commons pull request target mistake

I'm just putting this here as an announcement that I will shortly be renaming the main branches as follows:

master ➡️ stable
develop ➡️ master

This should prevent the common problem where people don't change the target to develop even though we're trying to use git-flow.

PyPi package is not up to date

Hi! I just compared your github codebase to the your PyPi package (installed via pip) and noticed that the code there seems to be out of date (even though it gives the same version number).

Would be cool to have an up-to-date version on pypi since there's a problem with ascii escape sequences in the pypi package that doesn't seem to exist with the github version :)

Adding an interface so that we can know the confidence of all encodings?

It might be useful to have some method on the UniversalDetector class that populates the confidence of every encoding for a given string. This can maybe help out figuring why we are failing on those 25 other test cases and also help if any encoding is to be added / deleted. Thoughts?

get wrong codec

I use chardet to detect the codec of the 'Cinecitt%C3%A0%20Make', like this

import urlparse
import chardet
a = 'Cinecitt%C3%A0%20Make'
b = urlparse.unquote(a)
chardet.detect(b)

the result is {'confidence': 0.814286076190637, 'encoding': 'ISO-8859-2'}
but '%c3%a0' is an utf8 code of character 'à'
is there something wrong?

Add warning about encodings not supported by Python

We currently detect EUC-TW pretty well, but it's not actually supported by Python. Most users would expect that

result = chardet.detect(some_bytes)
try:
    some_bytes.decode(result['encoding'])
except UnicodeDecodeError:
    print('Oops. chardet detected the wrong encoding')

would always work, but the decode line can actually fail with a LookupError too because of encodings that aren't supported by Python.

Provide classifiers for PyPI that specify which Python versions chardet supports

Fix failing unit tests.

Currently, the following 27 unit tests fail. We need to figure that out and fix them.

.FFF.FF..FFF........F............................................................................................................................................F.......FFFFFFF.FFFFFF.......................................F....F.FF.........................................................................................................................................................
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.bus.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.cmm.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.fin.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.mrt.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'windows-1253' != 'iso-8859-7'
- windows-1253
+ iso-8859-7
 : Expected iso-8859-7, but got 'windows-1253' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/disabled.gr.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.spo.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.mrk.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
?          ^
+ iso-8859-7
?          ^
 : Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.wld.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'utf-8'
- iso-8859-2
+ utf-8
 : Expected utf-8, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/utf-8/bom-utf-8.srt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'maccyrillic' != 'iso-8859-6'
- maccyrillic
+ iso-8859-6
 : Expected iso-8859-6, but got 'MacCyrillic' in /home/travis/build/erikrose/chardet/tests/iso-8859-6-arabic/_chromium_ISO-8859-6_with_no_encoding_specified.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'latin1'
- iso-8859-2
+ latin1
 : Expected latin1, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_2.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'tis-620' != 'latin1'
- tis-620
+ latin1
 : Expected latin1, but got 'TIS-620' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_4.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'ascii' != 'latin1'
- ascii
+ latin1
 : Expected latin1, but got 'ascii' in /home/travis/build/erikrose/chardet/tests/latin1/_mozilla_bug638318_text.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'latin1'
- iso-8859-2
+ latin1
 : Expected latin1, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_3.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'ibm855' != 'latin1'
- ibm855
+ latin1
 : Expected latin1, but got 'IBM855' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_1.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1252'
- iso-8859-2
+ windows-1252
 : Expected windows-1252, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1252/github_bug_9.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1252'
- iso-8859-2
+ windows-1252
 : Expected windows-1252, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1252/_mozilla_bug421271_text.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'ibm855' != 'windows-1250'
- ibm855
+ windows-1250
 : Expected windows-1250, but got 'IBM855' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.pressreview.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1250'
- iso-8859-2
+ windows-1250
 : Expected windows-1250, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.learningenglish.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'windows-1255' != 'windows-1250'
- windows-1255
?            ^
+ windows-1250
?            ^
 : Expected windows-1250, but got 'windows-1255' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-7' != 'windows-1250'
- iso-8859-7
+ windows-1250
 : Expected windows-1250, but got 'ISO-8859-7' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/objektivhir.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1250'
- iso-8859-2
+ windows-1250
 : Expected windows-1250, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.forum.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'maccyrillic' != 'windows-1256'
- maccyrillic
+ windows-1256
 : Expected windows-1256, but got 'MacCyrillic' in /home/travis/build/erikrose/chardet/tests/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'windows-1251' != 'iso-8859-2'
- windows-1251
+ iso-8859-2
 : Expected iso-8859-2, but got 'windows-1251' in /home/travis/build/erikrose/chardet/tests/iso-8859-2-hungarian/cigartower.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-7' != 'iso-8859-2'
- iso-8859-7
?          ^
+ iso-8859-2
?          ^
 : Expected iso-8859-2, but got 'ISO-8859-7' in /home/travis/build/erikrose/chardet/tests/iso-8859-2-hungarian/escience.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'koi8-r' != 'iso-8859-2'
- koi8-r
+ iso-8859-2
 : Expected iso-8859-2, but got 'KOI8-R' in /home/travis/build/erikrose/chardet/tests/iso-8859-2-hungarian/shamalt.uw.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 34, in runTest
    self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1254'
- iso-8859-2
+ windows-1254
 : Expected windows-1254, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1254-turkish/_chromium_windows-1254_with_no_encoding_specified.html
----------------------------------------------------------------------
Ran 384 tests in 109.871s
FAILED (failures=27)

Misdetection of utf-8 as ISO-8859-2 for german umlauts

Version: chardetect-script.py 2.3.0 on Windows environment; Python 3.4.3 (native) and in Cygwin shell (same in other shells)
Here the content of the misdetected script:

#!/usr/bin/env python3
# coding: utf-8
#
################################################################################

__version__ = '1.0'
__author__ = 'ü'

output of chardet:

$ chardetect uu.py 
uu.py: ISO-8859-2 with confidence 0.7916670185186749

output of file (Cygwin):

$ file uu.py 
uu.py: a /usr/bin/env python3 script, UTF-8 Unicode text executable, with CRLF line terminators

If i read the problematic line by open(file, 'rb').readlines() i get
b"__author__ = '\xc3\xbc'\r\n"

Am i getting something wrong?

Changing license

This feels strange to be posing as a question, since I'm one of the co-maintainers, but @sigmavirus24 and @erikrose, do you know if it's okay/legal for us to change the license of chardet? Because it was started by Mark Pilgrim I feel like it's kind of a nebulous question, because he's not someone you can just email, and he has nothing to do with development anymore. I would really like to change the license to at least be MPL, since that's what the C++ version is, and our setup currently mirrors that code pretty closely.

I'm not a fan of the LGPL and feel weird having a project I work on use it.

Is it OK to replace GB2312 by GB18030?

From wiki:
Like UTF-8, GB18030 is a superset of ASCII and can represent the whole range of Unicode code points; in addition, it is also a superset of GB2312.

If GB18030 is a superset of GB2312, is it OK to replace GB2312 by GB18030?

Retraining and storing data

Currently we have a ton of encoding-specific data stored as constants all over the place. This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.

The problems I see with the current approach are:

Storing large amounts of data in code makes it much more difficult to read and separate out which files are just data from those that contain actual encoding/prober-specific code.
Retraining the models we have (which are currently based on data from the late 90s) is difficult, because we would have to write a script that generates Python code. Yuck.
It makes the barrier to entry for adding support for new encodings higher than it should be. We should be able to have a tool that takes a bunch of text of a given encoding and generates the tables we need and determines things like the typical "positive ratio" (which is really the ratio of the token frequency of the 512 most common character bigram types to the total number of bigram tokens in a "typical" corpus) automatically. The current layout of the code is very confusing to a new contributor (see point 1).
Because retraining is difficult, chardet is going to get less accurate over time. Speaking as an NLP researcher, I can confidently say that the genre of a text plays a big role in how likely certain character sequences are, and as time goes on the typical web text we see looks less and less like it did when Mozilla collected their original data. Also, our accuracy for text that isn't from webpages is probably not that great.

So if we're agreement that the current approach is bad, how do we want to fix it?

I propose that we:

Store the data in either JSON or YAML formats in the GitHub repository. This would potentially allow us to share our data with chardet ports written in other languages (if they wanted to support our format).
As part of the setup.py install process, convert the files to pickled dictionaries.
Modify the prober initializers to take a path to either a pickled dictionary or a JSON/YAML file and load up that data at run-time. Supporting both types of file would simplify development, since we could play around with models without having to constantly convert them to pickles.
Modify chardet.detect to cache its UniversalDetector object so that we don't constantly create new prober objects and reload the pickles.

The only problem I see with this approach is that it will slow down import chardet, but loading pickles is usually pretty fast.

@sigmavirus24, what do you think?

ord() expected a character, but string of length 278 found

  File "/root/w3af/w3af/core/data/misc/encoding.py", line 89, in smart_unicode
    guessed_encoding = chardet.detect(s)['encoding']
  File "/usr/local/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/local/lib/python2.7/dist-packages/chardet/universaldetector.py", line 119, in feed
    if prober.feed(aBuf) == constants.eFoundIt:
  File "/usr/local/lib/python2.7/dist-packages/chardet/charsetgroupprober.py", line 59, in feed
    st = prober.feed(aBuf)
  File "/usr/local/lib/python2.7/dist-packages/chardet/utf8prober.py", line 52, in feed
    codingState = self._mCodingSM.next_state(c)
  File "/usr/local/lib/python2.7/dist-packages/chardet/codingstatemachine.py", line 44, in next_state
    byteCls = self._mModel['classTable'][ord(c)]

Original bug report at the w3af project which uses chardet==2.1.1

GB18030 for Chinese

I just check the source code from upstream. They commented clearly that We use gb18030 to replace gb2312, because 18030 is a superset. If the upstream already corrected this problem, there is no reason to hold the problem here. Please reopen and merge 33. It's funny to write the following code.

if encoding == 'GB2312':
    encoding = 'GB18030'

IndexError: tuple index out of range

The following string raises the titular exception when run chardet.detect is run on it using 9e419e9

b'\xfe\xcf'

Here's the full stack trace:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/david/crap/chardet/chardet/__init__.py", line 30, in detect
    u.feed(byte_str)
  File "/home/david/crap/chardet/chardet/universaldetector.py", line 189, in feed
    if prober.feed(byte_str) == ProbingState.found_it:
  File "/home/david/crap/chardet/chardet/charsetgroupprober.py", line 63, in feed
    state = prober.feed(byte_str)
  File "/home/david/crap/chardet/chardet/mbcharsetprober.py", line 75, in feed
    char_len)
  File "/home/david/crap/chardet/chardet/chardistribution.py", line 82, in feed
    if 512 > self._char_to_freq_order[order]:
IndexError: tuple index out of range

This one doesn't actually come from Hypothesis but from a fuzzing experiment I was running which it occurred to me would be applicable to chardet.

iso-8859-7 encoding of non-breaking space prevents encoding detection

The following string is detected as having a None encoding despite being a valid string in one of chardet's supported encodings:

u'<\xa0'.encode('iso-8859-7')

This remains true even if you pad it with ascii, it's not a length issue.

This behaviour is present in 9e419e9 (I just only was able to find it once #63 was fixed).

Shall I send you a pull request with the test that is finding these? It's not very complicated.

Incorrectly detecting valid UTF-16 as UTF-32LE, for which it is invalid

Given the following string:

u'\x000'.encode('utf-16')

chardet.detect as of 2.3.0 reports this as 'UTF-32LE' with a confidence of 1.0, but attempting to decode it as such fails with

UnicodeDecodeError: 'utf32' codec can't decode bytes in position 4-5: truncated data

I found this bug using Hypothesis. I'd be happy to submit a pull request adding the test that found it if you'd like me to, though it is of course currently failing.

Merge with cChardet?

I only recently discovered that there's a substantially faster version of chardet, cChardet, that's just a Cython wrapper around uchardet-enhanced.

According to their benchmarks it's about 2800 times faster, so if we're only doing the same things they are, maybe we should recommend people who are using CPython use that.

multiple detection failures with 2.2.1

Using d5d0812, I am seeing multiple failues to correctly detect utf-8. Some examples: "gebührenfrei", "exámple", "naïve", "sie hören", "This is a cat 😸". (Strings that are ASCII except for a single character seem to be particularly troublesome.)

See also sv24-archive/charade#24, sv24-archive/charade#25. d5d0812 seems to be doing slightly "better" than python-chardet-2.0.1-7.fc20.noarch, for what it's worth (fewer confidence = 0.99 detections, though still wrong).

I also see 27 failed unit tests. Please let me know if this is known and/or if I should paste the complete error log here.

ISO-8859-2 file not recognised by new version but properly recognised by older version.

Hello ,

I have a situation where I'm trying to detect encoding of ISO-8859-2 file.

In [1]: import chardet

In [2]: chardet.__version__
Out[2]: '2.2.1'

In [3]: chardet.detect(file('iso_file.csv', mode='rb').read())
Out[3]: {'confidence': 0.8727101643152726, 'encoding': 'ISO-8859-2'}

As you can see it's properly detected.

But after pip install -U chardet

In [16]: import chardet

In [17]: chardet.__version__
Out[17]: '2.3.0'

In [18]: chardet.detect(file('iso_file.csv', mode='rb').read())
Out[18]: {'confidence': 1.0, 'encoding': 'UTF-8-SIG'}

Can you provide some details around what was changed in new version that would trigger incorrect behaviour and what I should do from my side to help library better recognize encoding?

Turkish (iso-8859-9) detecting.

I want to help for Turkish language detection but I don't understand what order or how I supposed to put things to lang*model.py.

Turkish is Latin5 8-bit single-byte coded Language
http://en.wikipedia.org/wiki/ISO-8859-9
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-128.pdf (code table is on page 16)

Reformat docs into Sphinx format

Since this project was created, Sphinx has become the de-facto standard for publishing [Python] documentation on the web.

Thus, should reformat docs into Sphinx format

#23

ISO-8859-1 being detect as Windows-1252

Sorry if there is already an issue related to this, but I'm not sure what is cause yet.

I've got an ISO-8859-1 string detected as Windows-1252, although those two encoding differ in only 32 characters, mystring.decode('windows-1252') fail to decode the content, that's why I'm filing this issue.

If you need a test case you can query whois data for brasil.gov.br the string will contain ISO-8859-1 encoded data however will be detect as Windows-1252 with 99% confidence.

Old error in the SBCharSetProber.cpp (or .py) of the Universal Charset Detector

Hello,

when I was working on my new language models for Central European countries I found old error in the sbcharsetprober.py (or .cpp) file.

I've looked around on the internet and I've found only ONE developer/contributor (PyYoshi): who corrected this amazing error (There are some other bugs are corrected and added many new language models).

In the code of all forks I've found (python, cpp, ...) "-1" is missing (The part of source code below is already corrected):

// Order is in [1-64] but we want 0-63 here.
order = mModel->charToOrderMap[(unsigned char)aBuf[i]] -1

if (order < SYMBOL_CAT_ORDER)
  mTotalChar++;
if (order < SAMPLE_SIZE)
  {

I've spent half a day to understand why my new lang models give very low
confidence value for tested text. With adding "-1" confident values are
normal.

If you can, please, post this info to other chardet developers.

Many thanks.

Certain input creates extremely long runtime and memory leak

I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.

After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).

Versions:

Fedora release 20 (Heisenbug) x86_64
chardet-2.2.1 (via pip)
python3-3.3.2-11.fc20.x86_64
python-2.7.5-11.fc20.x86_64

How to reproduce:

I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:

setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 43 ms per loop
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 1 loops, best of 3: 1min 22s per loop
python3 mem_leak_test.py
# produces:
# Good input left 2.65 MB of unfreed memory.
# Bad input left 220.16 MB of unfreed memory.

python -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 41.7 ms per loop
python -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 10 loops, best of 3: 111 sec per loop
python mem_leak_test.py
# produces:
# Good input left 3.00 MB of unfreed memory.
# Bad input left 312.00 MB of unfreed memory.

mem_leak_test.py:

import resource
import chardet
import gc

mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
html = open("mem_leak_html.txt", "rb").read()

def test(desc, instr):
    gc.collect()
    mem_start = mem_use()
    chardet.detect(instr)    
    gc.collect()
    mem_used = mem_use() - mem_start
    print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))    

test('Good input', html[:2543482])
test('Bad input', html[:2543483])

Line Feed

Hi Guys,
Does this return information about the line feed of a file e.g.
LF+CR
LF
CR
CR+LF

failed to detect utf8 "FAHR•WERK"

I got the original codec from url "FAHR%E2%80%A2WERK", it's "FAHR•WERK" in utf-8.
But when I use chardet.detect, the result is {'confidence': 0.73, 'encoding': 'windows-1252'}
and the confidence of the utf-8 is only 0.505.
And I think there's sth wrong with utf8prober. So I look into utf8prober.py, and find code below:

elif coding_state == MachineState.start:
    if self.coding_sm.get_current_charlen() >= 2:
        self._num_mb_chars += 1

It seems only multibyte character can be judged as an utf-8 character.
So codec like "FAHR%E2%80%A2WERK" get a very low confidence.

In this case , I think we should judge the the single byte character also as an utf-8 character.
So I change the code into:

elif coding_state == MachineState.start:
    if self.coding_sm.get_current_charlen() >= 1:
        self._num_mb_chars += 1

and the result is {'confidence': 0.99, 'encoding': 'utf-8'}

No support for Arabic language (cp1256)

I just found a test for Arabic.
https://github.com/chardet/chardet/blob/master/tests/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.html

According to this link, cp1256 is also an important encoding.
http://w3techs.com/technologies/overview/character_encoding/all

LGPL and iPhone

Had an "interesting" discussion in the Stackoverflow Python room (bookmarked transcript here)

There's an article compatibility between the iphone and lgpl that has some analysis as well as links to other resources, but it appears there's not really a conclusion except the owner could assert rights, but in the spirit of things most likely wouldn't...

I'm just wondering what the stance is from the author(s) here?

Please do not remove old versions from pypi

Hi,

I notice that older versions (pre-2.3.0) have been removed from https://pypi.python.org/pypi/chardet

Removing older versions will break people's deployments. Could you please keep this in mind when making new releases?

Thanks in advance,

Kees

Can not distinguish UTF-8 and UTF-8-SIG

I tried to change the universaldetector.py file, but it still returned utf-8.

            if aBuf[:3] == codecs.BOM:
                # EF BB BF  UTF-8 with BOM
                self.result = {'encoding': "UTF-8-SIG", 'confidence': 1.0}

Output:

>>> chardet.detect(open('test.txt').read())
{'confidence': 0.99, 'encoding': 'utf-8'}

UTF-8 + win EOL + 1st empty string = error

I have a file in UTF-8 with BOM with Dos\Windows lines end and first string empty:

EF BB BF 0D 0A 23 69 6E .......(text)

info = chardet.detect(raw)

info {'confidence': 0.7821182921733318, 'encoding': 'ISO-8859-2'}

There is error with Unix line end (\n) also :(

It's needed to say, that I read file this way:
file_open = open(file_path, "r") # rb also error
raw = file_open.read()

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32

Misdetects iso8859-1 as windows-1251 (cyrillic)

>>> chardet.detect('"ULTIMA ATUALIZACAO";"17/03/2014 04:01"\r\n"ANO";"MES";"SENADOR";"TIPO_DESPESA";"CNPJ_CPF";"FORNECEDOR";"DOCUMENTO";"DATA";"DETALHAMENTO";"VALOR_REEMBOLSADO"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"05.914.650/0001-66";"CERON - CENTRAIS EL\xc9TRICAS DE ROND\xd4NIA S.A.";"45216633";"11/01/11";"";"47,65"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"05.914.650/0001-66";"CERON - CENTRAIS EL\xc9TRICAS DE ROND\xd4NIA S.A.";"4542061";"18/01/11";"";"196,67"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"004.948.028-63";"GILBERTO PISELO DO NASCIMENTO";"01";"12/01/11";"";"5000"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"76.535.764/0001-43";"OI BRASIL TELECOM S.A.";"963011";"14/01/11";"";"480,59"\r\n"2011";"1";"ACIR GURGACZ";"Aquisi\xe7\xe3o de ma')
{'confidence': 0.99, 'encoding': 'windows-1251'}

Escape control character also prevents detection in utf-8

The following has no encoding detected:

u'\u020d\x1b'.encode('utf-8')

Again, not a length issue. Padding with ascii doesn't change anything. The initial character is needed, I believe purely because it prevents the fix in #63 from applying by preventing the string from being considered as ascii.

Behaviour is present in 9e419e9

Incorrect Detection of HZ-GB-2312 with ASCII Text

Recently had an issue with chardet usage in the in the requests module where it was incorrectly detecting the encoding on a JSON Blob. I discovered the response.apparent_encoding and that chardet is used to set it.

I was able to identify what in my data was causing the wrong detection and distill it down to the occurrence of these simple strings.

$ cat test_chardet 
~{,
~},

$ file test_chardet
test_chardetect: ASCII text

$ chardetect test_chardet
test_chardetect: HZ-GB-2312 with confidence 0.99

Originally it was a JSON Blog of all ASCII characters encoded as UTF-8. As a work around I have set up the service to specify the following in the header -

{'Content-Type': 'application/json; charset=utf-8'}

Which will have requests set the encoding properly.

I can add ASCII characters to the file, white space, quotes and numbers. It will still detect as HZ-GB-2312.

text encoding detected as windows-1255 (hebrew) while being windows-1252

With a certain file, charted detects encoding as windows-1255. I would have thought it might be the file itself but my IDE for python development detected the encoding properly.
File link:
http://simple.podnapisi.net/fr/ppodnapisi/download/i/2292027/k/b5a3b7904326647fce0270ef3f441de7b73663af

filter_with_english_letters

Currently, this function has been listed as a TO-DO. I was looking over at the source from Mozilla and it seems that there could be a bug in that.

From what I can tell, the original intention of this function was to remove all markup tags. Its used in the LatinProber and I imagine that the idea is to remove all markup tags - which will probably contain english alphabets/words - so that we can avoid skewing our confidence incorrectly.

The current behavior though is not that. A simple example:

<some tag> outside <some tag>

returns

tag outside tag

It includes parts of text within a tag if there are multiple words separated by any kind of punctuation in the tag. I can look into this, by I wanted to know your thoughts on this first?

Failing to guess a single MS-apostrophe

I have a page of text in ASCII with a single Microsoft-apostrophe chr(8217) detected as ISO-8859-2.

#1. Create problematic sample
>>> s = 'today' + chr(8217) + 's research'
>>> s
'today’s research'
>>> b = s.encode('windows-1252')
>>> b
b'today\x92s research'

#2. Attempt to decode it
>>> chardet.detect(b)
{'encoding': 'ISO-8859-2', 'confidence': 0.8060609643099236}
>>> b.decode('ISO-8859-2')
'today\x92s research'

#3. Now try the correct encoding
>>> b.decode('windows-1252')
'today’s research'

This text is very typical of anything created using a Microsoft editor. Furthermore, latest version of Firefox detects it correctly. I am using Python 3.3. Any help is appreciated.

Inconsistent behavior on small strings

UPD: Deleted python2.7 example because it was not working properly. See a comment below for a better test case.

This is all on Debian GNU/Linux unstable with the current master:

$ python3.4 -c "import chardet; print(chardet.detect(u'é'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python3.4 -c "import chardet; print(chardet.detect(u'éé'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}

The second line should be utf-8 as well, not windows-1252.

New release

Latest release is very old - 2014-10-07. There are some useful updates since this date.

ASCII escape control character prevents detection of encoding

It appears to be the case that the presence of '\x1b' character (the escape key) in an ascii string prevents it from being recognised as ascii. e.g. for any n the following returns an encoding of None:

b'0' * n + b'\x1b'

This behaviour is present in master at ff1d917

windows-1255 instead of windows-1252 but windows-1255 trigger UnicodeError

Here's a demonstration of an example where the detection return a charset that triggers an unicode error

text =b'xxxx), xx xxxxxx xx xxxxxx xx\xe9\xe9x \xe0 xxxxx x xxxx \nxxxxx\xe9 xx xx xxx xxxxxxx\xe9.\n\n*__*\n\nxx *xxx, x. xxxxx*, xxxxx xxxxxxx xx xxxxxx xxxx x\xe9xxxxxx \xe0 xxxxxxxx \nxxxx xxx xxxxxxxxx xxxxxx\xe9xx xxx xxx xxxxxxxxxxxxxxx xxxx xx xx xx xx \nxxxxxxx xx\xe9x\xe9xxxxx :\n\n- xx xxxxxx x\\xxxxxxxxxxxx xxxxxxxxxx \xe0 x\\xxxxxxxxxxxxxx\xe9 xx xxxx xx xxxx : xxx \nxx ; xxx (xxxx) / xxx (xxxx) xxxxxx.\n\n- xxx x\xe9xxxxxx xxx xxxxxxxxxxxx xxxx xxxx : xx xxxx\xe9xxxxxxx xxxxxxxxx \n(xx xxx xxx \\xxxxx) / xxxxxxxxx (xx xxx xxx \\xxxxx) xxx xxxxxxx x\\xxxxxxxx xxxxxxxxx\xe9 \nxx xxx.\n\n- xxx x\xe9xxxxxx xxx xxxxxx xx xxxxxxxxx xxxx xxxx : xx xxxx\xe9xxxxx xxxxx \nxxx x\xe9xxxxxx xxxx xx xxxxxxx (xx xxx xxx \\xxxxx) xx xxxxxx xxxx xx xxxxxx (xx \nxxx xxx \\xxxxx) x\\xxxxxxxxxxxxx xxx xx xx\xfbx xxxx xxxxxxxxx xxx xxxxxx (xxxxxxxxx \nxxxx xxxxxx, xxxxx xx xxxxxxxxx xxxx xxxxxxxxx ...) xx x\\xxxxxxxxxxx xx \nxxxxxxxxxx xxx\xe9xxxxxxx xxxx xxxxxxxx xxxxxxx.\n\n- xxx xxxxxx x\\xxxxxxxxxxxxxxxx : xx xxxx\xe9xxxxx xx xxxxxxxx xx\xe9xxxx xxxxx \nxxxx xx xxxx xxxxxxx\xe9x xx xxxxxxx xxxxx\xe8'

chardet.detect(text) # => windows-1255
chardet.decode('windows-1255')
# => UnicodeDecodeError: 'charmap' codec can't decode byte 0xfb in position 724: character maps to <undefined>

chardet.detect(text) # => windows-1252
chardet.decode('windows-1252') # => works

Something nice would be:

to return multiple charsets (sorted by prober confidence) so I can try decoding each myself
- this is a PR I can do
scan the whole string to detect invalid characters for the charset

Meaningwhile, here's my solution:

try:
    return part.decode(charset)
except UnicodeDecodeError:
    detector = UniversalDetector()
    detector.feed(part)
    detector.close()
    try:
        return part.decode(detector.result['encoding'])
    except UnicodeDecodeError as e:
        for prober in detector._charset_probers:
            if prober.get_confidence() > detector.MINIMUM_THRESHOLD:
                try:
                    return part.decode(prober.charset_name)
                except UnicodeDecodeError:
                    pass
        raise e

Detection of windows-1250

Hi!
Is it possible to do a windows-1250 detection? Current implementation returns "windows-1252" for text encoded in windows-1250. Same question goes for "ISO-8859-2" vs. "ISO-8859-1".

Host the docs on the web

The docs from https://github.com/chardet/chardet/tree/master/docs should be hosted on a web site, to make them more accessible.

https://readthedocs.org/ maybe?

Potential changes to filter_without_english_characters?

I noticed that the filter_without_english_characters function in chardet simply replaces any english alphabetical character with a space character. This might lead to inaccuracies in our confidence. I tried mimicking the behavior of Mozilla's implementation more closely and this led to a decrease in the count of failing unit tests from 28 to 25.

I wanted to know your thoughts on this and you can also check out my changes over here. .

Chardet returns ISO-8859 when cp1252 is better

Ages ago, I filed a bug that got erroneously closed by a commit. I just stumbled on it again today, so I'm moving it over from the old location so we can see about getting it going again.

Here's the text of the old bug:

I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library

My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252." It would be great if chardet internalized this behavior.

Basically, browsers don't use a number of character encodings, and instead map to other ones instead. Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers.

This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.

Is there still an appetite for this kind of an issue? Basically, I think—and ages ago, committer @dan-blanchard agreed—that chardet should never return ISO-8859 and should always return cp1252 instead.

Upgrading from oldish version

Hi guys,

I recently took over a project that make use of chardet @ 73ab963 which I guess is 1.1.

Do you have any pointer on big breaking changes since then I should be aware of?

Thx !

def detect(byte_str):
    if (PY2 and isinstance(byte_str, unicode)) or (PY3 and
                                               not isinstance(byte_str,
                                                              bytes)):
        raise ValueError('Expected a bytes object, not a unicode object')

Can we perhaps add some way of detecting if we're using IronPython and then a check in init, which would pass bytes(string) to u.feed() instead?

For instance, in compat.py:

import platform

if (platform.python_implementation() == 'IronPython'):
    IPY = True
else:
    IPY = False

in init.py:

from compat import IPY
if (PY2 and not IPY and isinstance(byte_str, unicode)) or (PY3 or IPY and not isinstance(byte_str, bytes)):
    raise ValueError('Expected a bytes object, not a unicode object')