Currently we have a ton of encoding-specific data stored as constants all over the pla

uchardet-enhanced actually has <a href="https://bitbucket.org/medoc/uchardet-enhanced/

I've created small 'create_language_model.py' for building new python's languag

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="22

Retraining and storing data about chardet HOT 8 OPEN

chardet commented on June 1, 2024

Retraining and storing data

from chardet.

Comments (8)

dan-blanchard commented on June 1, 2024

uchardet-enhanced actually has tools for retraining the C code, so we might not need to do this is we switch to just being a CFFI wrapper around that.

That said, uchardet-enhanced hasn't been touched in almost 3 years, so I'm a bit torn about relying on that. I'd also like to see the data files stored in a language-agnostic format, but that might come second to speed for most users.

from chardet.

dan-blanchard commented on June 1, 2024

Well, so much for that thought. I'm not convinced their updated tables are actually correct. If I swap cChardet in for chardet and run all of our unit tests, there are actually 53 test failures (vs our 1), so it looks like we're much slower but more accurate at this point.

from chardet.

dan-blanchard commented on June 1, 2024

I've created a new branch that makes chardet work like cChardet/uchardet-enhanced, but in pure Python. It's called feature/uchardet-enhanced-upstream. It performs worse than our current version, so it probably wasn't worth the effort. Oh well.

from chardet.

commented on June 1, 2024

I have created 7 new language models for Central Europe countries. The romanian and hungarian language models can't distinguish between cp1250 and latin2 because all letters in romanian or hungarian alphabet have same place in both tables :-( . All new .py files with the modified sbcsgroupprober.py are in my repository and my fork.

from chardet.

dan-blanchard commented on June 1, 2024

The language models are intentionally language-specific (and not encoding-specific), so it's actually the character-to-order maps that index into the language model tables. Anyway, I don't know much about what subset of CP1250 is used for Hungarian and Romanian, but according to the table at the top of the Wikipedia article, it looks like at a minimum the Euro symbol and the quotation marks are in different places. Therefore, we should be able to differentiate between the two based on at least those characters.

If you have updated versions of the language models that work better in your fork, please create a pull request.

from chardet.

commented on June 1, 2024

From wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more.
It means that ISO-8859-2 doesn't contains "special" quotation marks (U+201E,U+201C,U+201D) or Euro symbol, etc. Without these "more characters" in an analyzed text you can't distinguish between Latin2 and Cp1250. Simply said, the ISO-8859-2 is subset of the CP1250 and some chars are rearranged (big thanks M$). It means that every text in Romanian or Hungarian language (it isn't true for other Central Europen languages) which contains only iso-8859-2 characters can be treated as the Cp1250.
In my opinion, for such text is simple better to consider any iso-8859-2 as the cp1250, or create a test (in the sbcharsetprober.py) if the detected text contains any char from the "more characters", because the language models are based on bigrams (twochars sequences) testing which contains only the 64 most frequent letters (=no symbols).

from chardet.

commented on June 1, 2024

I've created small script 'create_language_model.py' for building new python's language model files.
It needs 'CharsetsTabs.txt' file for correct running. Please read comments in the header.

from chardet.

dan-blanchard commented on June 1, 2024

#99 still uses Python language model files, but at least moves us in the right direction of allowing us to retrain at all.

from chardet.

Retraining and storing data about chardet HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent