Comments (21)
Hi,
Merging with cchardet is still a good idea. Still relevant nowadays.
There is still hope regarding chardet, I have worked on an alternative, fully native to python and faster than chardet but slower than uchardet for sure.
Maybe it's worth looking. https://github.com/Ousret/charset_normalizer/
I don't use specific probe for a specific encoding. Around 90 detectables encoding.
I'm seeking feedbacks.
from chardet.
Latest update on this: cChardet actually supports many of the encodings now that I wanted to add in #99, so it makes even more sense to try to merge.
from chardet.
I'm re-opening this issue because I'm beginning to think it would make more sense to just try to extend cChardet (and the forked version of uchardet-enhanced it now uses) to support additional languages than to ignore the substantial speed differences here.
from chardet.
Chardet is extraordinarily slow so I second the recommendation
from chardet.
Another idea is that we could try to merge cChardet with chardet so that there's one package, and if you're on CPython (or potentially PyPy, if cChardet works through cpyext), it uses cChardet, and otherwise it uses the pure Python version.
from chardet.
+1
cChardet as an optional speed-up C-extension to chardet is a great idea.
from chardet.
Any progress on this? Also the ability to "help" chardet by supplying a language would be nice.
from chardet.
I haven't had a chance to look into this more, as chardet is very low on my priority list at the moment.
Really, we should talk to @PyYoshi about such a potential merger, but no one has reached out to them yet as far as I'm aware; although, I guess @-mentioning in this message sort of counts. 😄
from chardet.
Hi,
I'm just providing some input, I like what chardet does but I think it could be improved, and the first thing is to use a native backend for performance reasons (it's nice to read its source code, but really slow). I'm sorry if I diverge a little from the original topic.
- The new chardet project health looks good
- Python is not good at doing what chardet does
- A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use
- I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub@ab176dc btw.
- I also found differences in behaviour of chardet and cChardet
- I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)
- There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.
- As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...
I'll hang around on #chardet on FreeNode, maybe I'll get something done in the week-end, maybe not.
from chardet.
Hi @zougloub your comment is very hard to follow, especially in the context of this issue. If I reply to something and it isn't replying to what you meant, I apologize in advance.
The new chardet project health looks good
I disagree. There are almost no contributors and neither @dan-blanchard nor I have much time for active development or maintenance of it.
Python is not good at doing what chardet does
We all agree on that.
A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use
I'm not familiar with cChardet or libchardetect and I'm not sure what you mean about restore some "order"
. As I understand it libchardetect can already be used by non-Pythonistas.
I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub/chardet@ab176dc btw.
If you found a bug (which is what I believe you're saying) please open a separate issue or (better yet) pull request regarding that. If you already have a fix (which it seems you do), please only open a pull request.
I also found differences in behaviour of chardet and cChardet
They're separate projects. Were we to merge with cChardet, those differences would likely be resolved. Alternatively if we simply added bindings to libchardetect, those differences would not be problematic.
I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)
This would have to be exposed differently. Currently our API is stable and in my opinion enough. There should be a separate issued to discuss expanding the API to return the n
most likely encodings.
There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.
As I already said, neither of us who are working on this have the affordance of time to do this. If someone else does this, that would be great.
As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...
Yes this is already planned. @dan-blanchard has already graciously volunteered to spend some time on this. I think you'll also find that our utf-8 detection is better (although it hasn't been released to PyPI just yet).
from chardet.
Thanks for the answer @sigmavirus24 ; I've been using chardet/charade/chardet for a while (everyday through "less") and it saves me time, but sometime gets wrong, so when I went to look into the source code I got a little lost, which should not necessarily be the case (and this goes up to the NS code). I'm taking a few hours to start a new library (https://github.com/zougloub/libchardet), I'll try in the beginning to have it expandable and proxyable with some implementations of the NS code, but in the end it would be nice to have a limpid library that does character encoding detection, and that is used by the Python chardet/chardet (or cChardet). Sorry for hijacking the thread.
from chardet.
@sigmavirus24 I've actually been thinking that we could switch to using the latest code from upstream directly via ctypes
, since ctypes
is supported by more than just CPython (unlike the Cython cChardet uses). Do you have any objections to such an approach?
from chardet.
CFFI is the far better option. CFFI has better support in PyPy and other implementations
from chardet.
Wow, CFFI is really nice. Somehow I missed that before and I've just been using ctypes.
Hmm... maybe a CFFI refactoring for DRMAA Python is in order too.
from chardet.
As I've discussed in other issues and PRs (namely #42 and #48), I'm more concerned at this point with making chardet more accurate and Pythonic at this point. After all the refactoring is done, and we have the ability to train new models, then we can work on making it faster, but at that point we'll be doing things quite differently from cChardet.
from chardet.
😬 It is really astonishing how wrong I can be sometimes. Since cChardet wraps uchardet-enhanced, and not the original Mozilla code, it actually has all of the fixes we want as far as improved accuracy and whatnot are concerned. Maybe a CFFI wrapper around uchardet-enhanced is the way to go after all.
from chardet.
A slight wrinkle in this plan is that there is one encoding we support that cChardet/uchardet-enhanced does not, CP932. It's a variant of SHIFT_JIS.
We also have a longstanding PR to add MacRoman support.
from chardet.
I am now completely against this, because I want chardet to be able to detect encodings and languages that neither cChardet nor uchardet-enhanced do. That's coming in the 4.0 release (#99).
from chardet.
This is definitely reasonable. If we do this correctly with CFFI we can get the speed without having to dive too much into C or restrict ourselves to CPython.
from chardet.
I actually checked after my most recent comment and cChardet appears to work on pypy as is. It does not work on Jython or IronPython from what I could tell, which I guess would be the only advantage of using CFFI, since the CFFI docs say it provieds "a reasonable path for other Python implementations like IronPython and Jython."
from chardet.
From a distro packaging perspective, this needs to be done carefully or the result may be a lot of confusion.
Distros typically provide uchardet , now hosted by freedesktop , which is behind many of the forks that have sprung up. I created https://github.com/PyYoshi/uchardet/issues/5 and https://gitlab.freedesktop.org/uchardet/uchardet/issues/13 about this. It looks like @PyYoshi's fork is one of the most advanced, so it would be a significant amount of work to get those patches into the main repo.
Distros will want to build cChardet using the uchardet library that they already have from freedesktop, which has a different feature set.
Note also PyYoshi/cChardet#26 , which is a fairly significant difference in detection from chardet
given the prevalence of the Windows curly smart quotes. Obviously the chardet
test suite should pass when using the optimised cChardet merged version, which should hopefully catch regressions like that.
from chardet.
Related Issues (20)
- [Bug]Predicted encoding error
- detect encode wrong!
- Detect pep-0263
- test_detect_all_and_detect_one_should_agree fails on Python 3.11b3 HOT 4
- Dependency warning (v5.0.0) HOT 1
- chardet 5.0 KeyError with Python 3.10 on Windows HOT 5
- Is the license LGPL v2.1 or later or just LGPLv2.1 only? HOT 3
- Documentation licensed only to non-commercial and personal use found
- Documentation licensed only to non-commercial and personal use found HOT 1
- Allow running of the package via `python3 -m chardet ...` HOT 4
- Encoding error
- Next release for Python 3.11 HOT 1
- type annotation and implementation mismatch HOT 2
- How to use Chardet for this Python code, as to read files that have ANSI encoder?
- chardetect cli: UnicodeEncodeError when filename is not utf8
- wrong result. actual johab - expected latin1 HOT 4
- Failed to detect CP932 encoded file
- pip intall chardet
- `chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks
- chardet detect UTF-8 XML File as EUC_KR - Possibility to exclude encodings?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chardet.