Comments (11)
@atbest notice your result. The confidence value is 0.99
, not 1.0
That means that this is not matching the way you expect it to, so this conditional is not evaluating to true. That means that chardet then starts using the probers to determine the character encoding. You'll likely have to determine how best to make a UTF8SIGProber
in the likeness of UTF8Prober
and a similar State Machine Model. Even then, given how similar the two encodings are (utf-8, utf-8-sig) you should understand that chardet
works with probabilities and returns the encoding that statistically seems most likely so getting a consistently correct answer could be extraordinarily difficult.
I do appreciate that you're trying to fix this though.
from chardet.
Thanks for the information. And I found it's very easy to fix. Just change the codecs.BOM to codecs.BOM_UTF8.
if aBuf[:3] == codecs.BOM_UTF8:
# EF BB BF UTF-8 with BOM
self.result = {'encoding': "UTF-8-SIG", 'confidence': 1.0}
from chardet.
Could you please make a PR for this?
from chardet.
@atbest please be sure to add this instead of modifying the existing lines.
from chardet.
@dan-blanchard Sorry, what does PR stand for? I am new here...
@sigmavirus24 According to this page, codecs.BOM is FF FE, while aBuf[:3] is three bytes, and aBuf[:3] == codecs.BOM is always False, so it is not necessary to keep existing lines.
from chardet.
@atbest you're right. Looking at the official documentation codecs.BOM
is an alias for codecs.BOM_UTF16
.
A PR is a Pull Request. You fork this repository, push your changes to your copy (preferably on a separate branch) and then click the button to create a pull request to this repository.
from chardet.
@atbest you can also just click the "edit" button on the GitHub page for the specific file you want to edit. That'll create the fork for you and make a PR when you hit save. It can be handy when you only have to make a one file change.
from chardet.
@dan-blanchard it can be handy but I've also seen it destroy formatting because the ACE editor is terrible.
from chardet.
Allright, done. Thanks.
from chardet.
Thanks @atbest
from chardet.
Fixed via #32.
Thanks @atbest!
from chardet.
Related Issues (20)
- detect encode wrong!
- Detect pep-0263
- test_detect_all_and_detect_one_should_agree fails on Python 3.11b3 HOT 4
- Dependency warning (v5.0.0) HOT 1
- chardet 5.0 KeyError with Python 3.10 on Windows HOT 5
- Is the license LGPL v2.1 or later or just LGPLv2.1 only? HOT 3
- Documentation licensed only to non-commercial and personal use found
- Documentation licensed only to non-commercial and personal use found HOT 1
- Allow running of the package via `python3 -m chardet ...` HOT 4
- Encoding error
- Next release for Python 3.11 HOT 1
- type annotation and implementation mismatch HOT 2
- How to use Chardet for this Python code, as to read files that have ANSI encoder?
- chardetect cli: UnicodeEncodeError when filename is not utf8
- wrong result. actual johab - expected latin1 HOT 4
- Failed to detect CP932 encoded file
- pip intall chardet
- `chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks
- chardet detect UTF-8 XML File as EUC_KR - Possibility to exclude encodings?
- Wrong detection UTF-8 with ΓΆ symbol
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chardet.