I only recently discovered that there's a substantially faster version of <code class=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Merge with cChardet? about chardet HOT 21 OPEN

chardet commented on June 12, 2024 3

Merge with cChardet?

from chardet.

Comments (21)

Ousret commented on June 12, 2024 3

Hi,

Merging with cchardet is still a good idea. Still relevant nowadays.

There is still hope regarding chardet, I have worked on an alternative, fully native to python and faster than chardet but slower than uchardet for sure.
Maybe it's worth looking. https://github.com/Ousret/charset_normalizer/

I don't use specific probe for a specific encoding. Around 90 detectables encoding.
I'm seeking feedbacks.

from chardet.

dan-blanchard commented on June 12, 2024 2

Latest update on this: cChardet actually supports many of the encodings now that I wanted to add in #99, so it makes even more sense to try to merge.

from chardet.

dan-blanchard commented on June 12, 2024 1

I'm re-opening this issue because I'm beginning to think it would make more sense to just try to extend cChardet (and the forked version of uchardet-enhanced it now uses) to support additional languages than to ignore the substantial speed differences here.

from chardet.

sigmavirus24 commented on June 12, 2024

Chardet is extraordinarily slow so I second the recommendation

from chardet.

dan-blanchard commented on June 12, 2024

Another idea is that we could try to merge cChardet with chardet so that there's one package, and if you're on CPython (or potentially PyPy, if cChardet works through cpyext), it uses cChardet, and otherwise it uses the pure Python version.

from chardet.

Diaoul commented on June 12, 2024

+1
cChardet as an optional speed-up C-extension to chardet is a great idea.

from chardet.

Diaoul commented on June 12, 2024

Any progress on this? Also the ability to "help" chardet by supplying a language would be nice.

from chardet.

dan-blanchard commented on June 12, 2024

I haven't had a chance to look into this more, as chardet is very low on my priority list at the moment.

Really, we should talk to @PyYoshi about such a potential merger, but no one has reached out to them yet as far as I'm aware; although, I guess @-mentioning in this message sort of counts. 😄

from chardet.

zougloub commented on June 12, 2024

Hi,

I'm just providing some input, I like what chardet does but I think it could be improved, and the first thing is to use a native backend for performance reasons (it's nice to read its source code, but really slow). I'm sorry if I diverge a little from the original topic.

The new chardet project health looks good
Python is not good at doing what chardet does
A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use
I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub@ab176dc btw.
I also found differences in behaviour of chardet and cChardet
I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)
There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.
As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...

I'll hang around on #chardet on FreeNode, maybe I'll get something done in the week-end, maybe not.

from chardet.

sigmavirus24 commented on June 12, 2024

Hi @zougloub your comment is very hard to follow, especially in the context of this issue. If I reply to something and it isn't replying to what you meant, I apologize in advance.

The new chardet project health looks good

I disagree. There are almost no contributors and neither @dan-blanchard nor I have much time for active development or maintenance of it.

Python is not good at doing what chardet does

We all agree on that.

A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use

I'm not familiar with cChardet or libchardetect and I'm not sure what you mean about restore some "order". As I understand it libchardetect can already be used by non-Pythonistas.

I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub/chardet@ab176dc btw.

If you found a bug (which is what I believe you're saying) please open a separate issue or (better yet) pull request regarding that. If you already have a fix (which it seems you do), please only open a pull request.

I also found differences in behaviour of chardet and cChardet

They're separate projects. Were we to merge with cChardet, those differences would likely be resolved. Alternatively if we simply added bindings to libchardetect, those differences would not be problematic.

I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)

This would have to be exposed differently. Currently our API is stable and in my opinion enough. There should be a separate issued to discuss expanding the API to return the n most likely encodings.

There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.

As I already said, neither of us who are working on this have the affordance of time to do this. If someone else does this, that would be great.

As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...

Yes this is already planned. @dan-blanchard has already graciously volunteered to spend some time on this. I think you'll also find that our utf-8 detection is better (although it hasn't been released to PyPI just yet).

from chardet.

zougloub commented on June 12, 2024

Thanks for the answer @sigmavirus24 ; I've been using chardet/charade/chardet for a while (everyday through "less") and it saves me time, but sometime gets wrong, so when I went to look into the source code I got a little lost, which should not necessarily be the case (and this goes up to the NS code). I'm taking a few hours to start a new library (https://github.com/zougloub/libchardet), I'll try in the beginning to have it expandable and proxyable with some implementations of the NS code, but in the end it would be nice to have a limpid library that does character encoding detection, and that is used by the Python chardet/chardet (or cChardet). Sorry for hijacking the thread.

from chardet.

dan-blanchard commented on June 12, 2024

@sigmavirus24 I've actually been thinking that we could switch to using the latest code from upstream directly via ctypes, since ctypes is supported by more than just CPython (unlike the Cython cChardet uses). Do you have any objections to such an approach?

from chardet.

sigmavirus24 commented on June 12, 2024

CFFI is the far better option. CFFI has better support in PyPy and other implementations

from chardet.

dan-blanchard commented on June 12, 2024

Wow, CFFI is really nice. Somehow I missed that before and I've just been using ctypes.

Hmm... maybe a CFFI refactoring for DRMAA Python is in order too.

from chardet.

dan-blanchard commented on June 12, 2024

As I've discussed in other issues and PRs (namely #42 and #48), I'm more concerned at this point with making chardet more accurate and Pythonic at this point. After all the refactoring is done, and we have the ability to train new models, then we can work on making it faster, but at that point we'll be doing things quite differently from cChardet.

from chardet.

dan-blanchard commented on June 12, 2024

😬 It is really astonishing how wrong I can be sometimes. Since cChardet wraps uchardet-enhanced, and not the original Mozilla code, it actually has all of the fixes we want as far as improved accuracy and whatnot are concerned. Maybe a CFFI wrapper around uchardet-enhanced is the way to go after all.

from chardet.

dan-blanchard commented on June 12, 2024

A slight wrinkle in this plan is that there is one encoding we support that cChardet/uchardet-enhanced does not, CP932. It's a variant of SHIFT_JIS.

We also have a longstanding PR to add MacRoman support.

from chardet.

dan-blanchard commented on June 12, 2024

I am now completely against this, because I want chardet to be able to detect encodings and languages that neither cChardet nor uchardet-enhanced do. That's coming in the 4.0 release (#99).

from chardet.

sigmavirus24 commented on June 12, 2024

This is definitely reasonable. If we do this correctly with CFFI we can get the speed without having to dive too much into C or restrict ourselves to CPython.

from chardet.

dan-blanchard commented on June 12, 2024

I actually checked after my most recent comment and cChardet appears to work on pypy as is. It does not work on Jython or IronPython from what I could tell, which I guess would be the only advantage of using CFFI, since the CFFI docs say it provieds "a reasonable path for other Python implementations like IronPython and Jython."

from chardet.

jayvdb commented on June 12, 2024

From a distro packaging perspective, this needs to be done carefully or the result may be a lot of confusion.

Distros typically provide uchardet , now hosted by freedesktop , which is behind many of the forks that have sprung up. I created https://github.com/PyYoshi/uchardet/issues/5 and https://gitlab.freedesktop.org/uchardet/uchardet/issues/13 about this. It looks like @PyYoshi's fork is one of the most advanced, so it would be a significant amount of work to get those patches into the main repo.

Distros will want to build cChardet using the uchardet library that they already have from freedesktop, which has a different feature set.

Note also PyYoshi/cChardet#26 , which is a fairly significant difference in detection from chardet given the prevalence of the Windows curly smart quotes. Obviously the chardet test suite should pass when using the optimised cChardet merged version, which should hopefully catch regressions like that.

from chardet.

Merge with cChardet? about chardet HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent