Giter Site home page Giter Site logo

Comments (13)

 avatar commented on June 12, 2024
  1. Don't try to detect encoding from two words. You need at least 256 characters.
  2. The Hungarian (iso-8859-2) language prober with its language model is bad and it leads to wrong results.

from chardet.

shompol avatar shompol commented on June 12, 2024

The text in question is 3240 characters long, all characters are ASCII except for the ill-fated apostrophe. The sample above is sufficient to replicate the issue.

I run it with chardet.constants._debug=1 and see that windows-1252 is not even attempted. Documentation says that windows-1252 is only attempted as the last resort. This is very wrong in my case since I am in North America. I guess the best way to proceed is to override a few methods...

from chardet.

dan-blanchard avatar dan-blanchard commented on June 12, 2024

@shompol I've been a bit behind on things lately since this is just a side project for me and I have a newborn son, but I plan to make a release "soon" (probably in the next month) that either disables the Hungarian prober (as is currently the case in master), or switches to the retrained version in #52. I would suggest giving the version in master a try and seeing if you still see the same issue.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 12, 2024

Also, Firefox detects things correctly because they disabled their Hungarian prober a while ago. I didn't realize that until a month or two ago.

from chardet.

shompol avatar shompol commented on June 12, 2024

Thank you Dan, the master version works much better:

>>> import chardet
>>> s = 'today' + chr(8217) + 's research'
>>> b = s.encode('windows-1252')
>>> chardet.detect(b)
{'encoding': 'windows-1252', 'confidence': 0.73}

The original 3KB text detection works as well.

May I inquire why the Latin-1 confidence is artificially reduced by 0.73? It seems a little odd to me, especially since Latin-1 covers 99% of all texts I work with in US.

Congratulations with the son! :)

from chardet.

sigmavirus24 avatar sigmavirus24 commented on June 12, 2024

@dan-blanchard feel free to assign me a bug for a new release.

from chardet.

 avatar commented on June 12, 2024

Good news:
I'm working on the new (mixed = more than one) language model for latin1 (cp1252) prober, which replaces old one (which is based on very "strange heuristic/statistic" (especially for me)). The new model will be based on the statistical probability like other "normal" single byte probers.
Bad news:
Some languages in the "latin1 group" give me bad bigrams correlations with other languages and I need to decide which one will be excluded from this group. For these languages I want to create a separate language models like in the latin2 group.

from chardet.

shompol avatar shompol commented on June 12, 2024

Just a note that apparently chardet is also used by BeautifulSoup, and it got fixed too:

>>> s = 'today' + chr(8217) + 's research'
>>> b = s.encode('windows-1252')
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(b)
>>> soup
<html><head></head><body>todays research</body></html>
>>> soup.original_encoding
'windows-1252'
  • - BeautifulSoup implicitly uses an HTML parser, in my case it is html5lib

from chardet.

dan-blanchard avatar dan-blanchard commented on June 12, 2024

Just a note that apparently chardet is also used by BeautifulSoup, and it got fixed too

BeautifulSoup uses cChardet if it's available, and uses chardet as a fallback. Unfortunately, cChardet and chardet have very different detection characteristics these days. We can handle more encodings than cChardet, and for some of them we're more accurate, but they're much faster (and already had the Hungarian detector disabled).

from chardet.

 avatar commented on June 12, 2024

@shompol: May I inquire why the Latin-1 confidence is artificially reduced by 0.73?
The latin1 prober is based on the "heuristic table" with letters types/categories (small ascii, capital ascii, small ascii with accute, ...) and it is not a normal/common language model. The values >0.73 causes that the latin1 prober wins before other common language models, which can leads to many wrong detections. There are many languages in the Latin1 group and better solution is to create a "heuristic table" instead of many common language models.

The main idea of the universal charset detector (for single byte charsets) is based on the searching of the closest language to the given text (without digits and symbols like apostrophes or quotation marks).
Please look at first here and then here.

BTW I have increased value 0.73 to 0.80 in my new latin1 prober.

from chardet.

 avatar commented on June 12, 2024

My own "milestone" (0 failures) has been achieved when I've found solution for failure with greek Ά:

Ran 416 tests in 62.030s

OK

All changes are in my fork. You can clone it and try nosetests.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 12, 2024

@helour That's fantastic news! I'm sorry I've been slow on the uptake of this, but I'll definitely check this out soon. Either this weekend or next week.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 12, 2024

Fixed in 3.0.

from chardet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.