Comments (5)
You're referring to filter_with_english_letters right? I'm surprised that function is doing anything when used from here. Could you expand on this?
from chardet.
Oh sorry, I should have been more specific. I was referring to the original implementation by Mozilla which can be found at cChardet as well: cChardet .
from chardet.
The implementation used by cChardet
is not actually the current reference implementation anymore. I'm trying to target revision 207643 on the Mozilla Hg repository because that was the last version before they pulled out support for several older encodings. I've got a bit of a start on that in #42 .
from chardet.
I've now got this added to #42. It's currently a fairly literal C translation, but could probably be improved in places.
As I'm going over more of the upstream code, I'm becoming more and more convinced that we likely do not want to emulate the Mozilla implementation anymore when it comes to Western encodings. Like @rsnair2 said, filter_with_english_letters
does some really bizarre HTML tag removal, but it's only used for Latin1Prober. I'm not sure it makes sense to apply to all documents, since there are plenty of non-XML/HTML documents that use <
and >
that will probably be thrown off by that filtering.
Another weird bit I've discovered in the current Mozilla code is that the Hungarian detectors are just commented out and unused. We don't have them commented out, and we frequently detect them instead of Windows-1252.
from chardet.
Fixed via #42.
from chardet.
Related Issues (20)
- detect encode wrong!
- Detect pep-0263
- test_detect_all_and_detect_one_should_agree fails on Python 3.11b3 HOT 4
- Dependency warning (v5.0.0) HOT 1
- chardet 5.0 KeyError with Python 3.10 on Windows HOT 5
- Is the license LGPL v2.1 or later or just LGPLv2.1 only? HOT 3
- Documentation licensed only to non-commercial and personal use found
- Documentation licensed only to non-commercial and personal use found HOT 1
- Allow running of the package via `python3 -m chardet ...` HOT 4
- Encoding error
- Next release for Python 3.11 HOT 1
- type annotation and implementation mismatch HOT 2
- How to use Chardet for this Python code, as to read files that have ANSI encoder?
- chardetect cli: UnicodeEncodeError when filename is not utf8
- wrong result. actual johab - expected latin1 HOT 4
- Failed to detect CP932 encoded file
- pip intall chardet
- `chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks
- chardet detect UTF-8 XML File as EUC_KR - Possibility to exclude encodings?
- Wrong detection UTF-8 with ΓΆ symbol
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chardet.