I have a page of text in ASCII with a single Microsoft-apostrophe <code class="notrans

Don't try to detect encoding from two words. You need at least 256 characters.</

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you Dan, the master version works much better: <div class="highlight highlig

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Failing to guess a single MS-apostrophe about chardet HOT 13 CLOSED

chardet commented on June 12, 2024

Failing to guess a single MS-apostrophe

from chardet.

Comments (13)

commented on June 12, 2024

Don't try to detect encoding from two words. You need at least 256 characters.
The Hungarian (iso-8859-2) language prober with its language model is bad and it leads to wrong results.

from chardet.

shompol commented on June 12, 2024

The text in question is 3240 characters long, all characters are ASCII except for the ill-fated apostrophe. The sample above is sufficient to replicate the issue.

I run it with chardet.constants._debug=1 and see that windows-1252 is not even attempted. Documentation says that windows-1252 is only attempted as the last resort. This is very wrong in my case since I am in North America. I guess the best way to proceed is to override a few methods...

from chardet.

dan-blanchard commented on June 12, 2024

@shompol I've been a bit behind on things lately since this is just a side project for me and I have a newborn son, but I plan to make a release "soon" (probably in the next month) that either disables the Hungarian prober (as is currently the case in master), or switches to the retrained version in #52. I would suggest giving the version in master a try and seeing if you still see the same issue.

from chardet.

dan-blanchard commented on June 12, 2024

Also, Firefox detects things correctly because they disabled their Hungarian prober a while ago. I didn't realize that until a month or two ago.

from chardet.

shompol commented on June 12, 2024

Thank you Dan, the master version works much better:

>>> import chardet
>>> s = 'today' + chr(8217) + 's research'
>>> b = s.encode('windows-1252')
>>> chardet.detect(b)
{'encoding': 'windows-1252', 'confidence': 0.73}

The original 3KB text detection works as well.

May I inquire why the Latin-1 confidence is artificially reduced by 0.73? It seems a little odd to me, especially since Latin-1 covers 99% of all texts I work with in US.

Congratulations with the son! :)

from chardet.

sigmavirus24 commented on June 12, 2024

@dan-blanchard feel free to assign me a bug for a new release.

from chardet.

commented on June 12, 2024

Good news:
I'm working on the new (mixed = more than one) language model for latin1 (cp1252) prober, which replaces old one (which is based on very "strange heuristic/statistic" (especially for me)). The new model will be based on the statistical probability like other "normal" single byte probers.
Bad news:
Some languages in the "latin1 group" give me bad bigrams correlations with other languages and I need to decide which one will be excluded from this group. For these languages I want to create a separate language models like in the latin2 group.

from chardet.

shompol commented on June 12, 2024

Just a note that apparently chardet is also used by BeautifulSoup, and it got fixed too:

>>> s = 'today' + chr(8217) + 's research'
>>> b = s.encode('windows-1252')
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(b)
>>> soup
<html><head></head><body>today’s research</body></html>
>>> soup.original_encoding
'windows-1252'

- BeautifulSoup implicitly uses an HTML parser, in my case it is html5lib

from chardet.

dan-blanchard commented on June 12, 2024

Just a note that apparently chardet is also used by BeautifulSoup, and it got fixed too

BeautifulSoup uses cChardet if it's available, and uses chardet as a fallback. Unfortunately, cChardet and chardet have very different detection characteristics these days. We can handle more encodings than cChardet, and for some of them we're more accurate, but they're much faster (and already had the Hungarian detector disabled).

from chardet.

commented on June 12, 2024

@shompol: May I inquire why the Latin-1 confidence is artificially reduced by 0.73?
The latin1 prober is based on the "heuristic table" with letters types/categories (small ascii, capital ascii, small ascii with accute, ...) and it is not a normal/common language model. The values >0.73 causes that the latin1 prober wins before other common language models, which can leads to many wrong detections. There are many languages in the Latin1 group and better solution is to create a "heuristic table" instead of many common language models.

The main idea of the universal charset detector (for single byte charsets) is based on the searching of the closest language to the given text (without digits and symbols like apostrophes or quotation marks).
Please look at first here and then here.

BTW I have increased value 0.73 to 0.80 in my new latin1 prober.

from chardet.

commented on June 12, 2024

My own "milestone" (0 failures) has been achieved when I've found solution for failure with greek Ά:

Ran 416 tests in 62.030s

OK

All changes are in my fork. You can clone it and try nosetests.

from chardet.

dan-blanchard commented on June 12, 2024

@helour That's fantastic news! I'm sorry I've been slow on the uptake of this, but I'll definitely check this out soon. Either this weekend or next week.

from chardet.

dan-blanchard commented on June 12, 2024

Fixed in 3.0.

from chardet.

Failing to guess a single MS-apostrophe about chardet HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent