Giter Site home page Giter Site logo

Retraining and storing data about chardet HOT 8 OPEN

chardet avatar chardet commented on June 1, 2024
Retraining and storing data

from chardet.

Comments (8)

dan-blanchard avatar dan-blanchard commented on June 1, 2024

uchardet-enhanced actually has tools for retraining the C code, so we might not need to do this is we switch to just being a CFFI wrapper around that.

That said, uchardet-enhanced hasn't been touched in almost 3 years, so I'm a bit torn about relying on that. I'd also like to see the data files stored in a language-agnostic format, but that might come second to speed for most users.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 1, 2024

Well, so much for that thought. I'm not convinced their updated tables are actually correct. If I swap cChardet in for chardet and run all of our unit tests, there are actually 53 test failures (vs our 1), so it looks like we're much slower but more accurate at this point.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 1, 2024

I've created a new branch that makes chardet work like cChardet/uchardet-enhanced, but in pure Python. It's called feature/uchardet-enhanced-upstream. It performs worse than our current version, so it probably wasn't worth the effort. Oh well.

from chardet.

 avatar commented on June 1, 2024

I have created 7 new language models for Central Europe countries. The romanian and hungarian language models can't distinguish between cp1250 and latin2 because all letters in romanian or hungarian alphabet have same place in both tables :-( . All new .py files with the modified sbcsgroupprober.py are in my repository and my fork.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 1, 2024

The language models are intentionally language-specific (and not encoding-specific), so it's actually the character-to-order maps that index into the language model tables. Anyway, I don't know much about what subset of CP1250 is used for Hungarian and Romanian, but according to the table at the top of the Wikipedia article, it looks like at a minimum the Euro symbol and the quotation marks are in different places. Therefore, we should be able to differentiate between the two based on at least those characters.

If you have updated versions of the language models that work better in your fork, please create a pull request.

from chardet.

 avatar commented on June 1, 2024

From wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more.
It means that ISO-8859-2 doesn't contains "special" quotation marks (U+201E,U+201C,U+201D) or Euro symbol, etc. Without these "more characters" in an analyzed text you can't distinguish between Latin2 and Cp1250. Simply said, the ISO-8859-2 is subset of the CP1250 and some chars are rearranged (big thanks M$). It means that every text in Romanian or Hungarian language (it isn't true for other Central Europen languages) which contains only iso-8859-2 characters can be treated as the Cp1250.
In my opinion, for such text is simple better to consider any iso-8859-2 as the cp1250, or create a test (in the sbcharsetprober.py) if the detected text contains any char from the "more characters", because the language models are based on bigrams (twochars sequences) testing which contains only the 64 most frequent letters (=no symbols).

from chardet.

 avatar commented on June 1, 2024

I've created small script 'create_language_model.py' for building new python's language model files.
It needs 'CharsetsTabs.txt' file for correct running. Please read comments in the header.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 1, 2024

#99 still uses Python language model files, but at least moves us in the right direction of allowing us to retrain at all.

from chardet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.