Comments (13)
- Don't try to detect encoding from two words. You need at least 256 characters.
- The Hungarian (iso-8859-2) language prober with its language model is bad and it leads to wrong results.
from chardet.
The text in question is 3240 characters long, all characters are ASCII except for the ill-fated apostrophe. The sample above is sufficient to replicate the issue.
I run it with chardet.constants._debug=1
and see that windows-1252
is not even attempted. Documentation says that windows-1252
is only attempted as the last resort. This is very wrong in my case since I am in North America. I guess the best way to proceed is to override a few methods...
from chardet.
@shompol I've been a bit behind on things lately since this is just a side project for me and I have a newborn son, but I plan to make a release "soon" (probably in the next month) that either disables the Hungarian prober (as is currently the case in master
), or switches to the retrained version in #52. I would suggest giving the version in master a try and seeing if you still see the same issue.
from chardet.
Also, Firefox detects things correctly because they disabled their Hungarian prober a while ago. I didn't realize that until a month or two ago.
from chardet.
Thank you Dan, the master version works much better:
>>> import chardet
>>> s = 'today' + chr(8217) + 's research'
>>> b = s.encode('windows-1252')
>>> chardet.detect(b)
{'encoding': 'windows-1252', 'confidence': 0.73}
The original 3KB text detection works as well.
May I inquire why the Latin-1 confidence is artificially reduced by 0.73? It seems a little odd to me, especially since Latin-1 covers 99% of all texts I work with in US.
Congratulations with the son! :)
from chardet.
@dan-blanchard feel free to assign me a bug for a new release.
from chardet.
Good news:
I'm working on the new (mixed = more than one) language model for latin1 (cp1252) prober, which replaces old one (which is based on very "strange heuristic/statistic" (especially for me)). The new model will be based on the statistical probability like other "normal" single byte probers.
Bad news:
Some languages in the "latin1 group" give me bad bigrams correlations with other languages and I need to decide which one will be excluded from this group. For these languages I want to create a separate language models like in the latin2 group.
from chardet.
Just a note that apparently chardet is also used by BeautifulSoup, and it got fixed too:
>>> s = 'today' + chr(8217) + 's research'
>>> b = s.encode('windows-1252')
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(b)
>>> soup
<html><head></head><body>today’s research</body></html>
>>> soup.original_encoding
'windows-1252'
- - BeautifulSoup implicitly uses an HTML parser, in my case it is
html5lib
from chardet.
Just a note that apparently chardet is also used by BeautifulSoup, and it got fixed too
BeautifulSoup uses cChardet if it's available, and uses chardet as a fallback. Unfortunately, cChardet and chardet have very different detection characteristics these days. We can handle more encodings than cChardet, and for some of them we're more accurate, but they're much faster (and already had the Hungarian detector disabled).
from chardet.
@shompol: May I inquire why the Latin-1 confidence is artificially reduced by 0.73?
The latin1 prober is based on the "heuristic table" with letters types/categories (small ascii, capital ascii, small ascii with accute, ...) and it is not a normal/common language model. The values >0.73 causes that the latin1 prober wins before other common language models, which can leads to many wrong detections. There are many languages in the Latin1 group and better solution is to create a "heuristic table" instead of many common language models.
The main idea of the universal charset detector (for single byte charsets) is based on the searching of the closest language to the given text (without digits and symbols like apostrophes or quotation marks).
Please look at first here and then here.
BTW I have increased value 0.73 to 0.80 in my new latin1 prober.
from chardet.
My own "milestone" (0 failures) has been achieved when I've found solution for failure with greek Ά:
Ran 416 tests in 62.030s
OK
All changes are in my fork. You can clone it and try nosetests.
from chardet.
@helour That's fantastic news! I'm sorry I've been slow on the uptake of this, but I'll definitely check this out soon. Either this weekend or next week.
from chardet.
Fixed in 3.0.
from chardet.
Related Issues (20)
- detect encode wrong!
- Detect pep-0263
- test_detect_all_and_detect_one_should_agree fails on Python 3.11b3 HOT 4
- Dependency warning (v5.0.0) HOT 1
- chardet 5.0 KeyError with Python 3.10 on Windows HOT 5
- Is the license LGPL v2.1 or later or just LGPLv2.1 only? HOT 3
- Documentation licensed only to non-commercial and personal use found
- Documentation licensed only to non-commercial and personal use found HOT 1
- Allow running of the package via `python3 -m chardet ...` HOT 4
- Encoding error
- Next release for Python 3.11 HOT 1
- type annotation and implementation mismatch HOT 2
- How to use Chardet for this Python code, as to read files that have ANSI encoder?
- chardetect cli: UnicodeEncodeError when filename is not utf8
- wrong result. actual johab - expected latin1 HOT 4
- Failed to detect CP932 encoded file
- pip intall chardet
- `chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks
- chardet detect UTF-8 XML File as EUC_KR - Possibility to exclude encodings?
- Wrong detection UTF-8 with ö symbol
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chardet.