Comments (6)
Thanks for the quick response.
The language is detected correctly, no tuning of frequencies is required.
I think it is worth limiting CP1125 to Ukrainian. Sorry, but I don't have the required knowledge to do this. You should look into it when you have time.
from charset_normalizer.
I have to say no. After much consideration, and drafts that did not agree with me in the end I am going to close this.
I am willing to accept PR as someone else with a fresh perspective could do something about it.
What I am inclined to do is update the documentation to explain that case a bit more.
from charset_normalizer.
Indeed those two code pages are very similar. The first course of action would be to verify the assets/__init__.py
FREQUENCIES
dict. The Russian and Ukrainian char freq order could need some fine-tuning.
If the above did not work, would need (in addition to the above) to limit CP1125 target language to Ukrainian.
Would you take a look? As I have limited time to offer this week.
Regards,
from charset_normalizer.
I have started working on your case.
With all the files you gave me, the result does not differ at all if the decoder is ibm866 or cp1125.
Even if I did limit the cp1125 to Ukrainian only, it would not matter at all.
with open("./char-dataset/ibm866/CHEBUR.TXT", "rb") as fp:
payload = fp.read()
print(payload.decode("cp1125") == payload.decode("cp866")) # Print True!
Given the following code:
from charset_normalizer import from_path
if __name__ == "__main__":
results = from_path(
r"./char-dataset/ibm866/MGICMASK.TXT",
explain=True,
cp_isolation=["cp1125", "ibm866"]
)
print(
str(results.best().could_be_from_charset) # Will print ["cp1125", "ibm866"]
)
I have limited the cp1125 to target Ukrainian only.
Logs:
2021-07-31 23:54:38,609 | WARNING | cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : cp1125, ibm866.
2021-07-31 23:54:38,623 | INFO | cp1125 passed initial chaos probing. Mean measured chaos is 0.000000 %
2021-07-31 23:54:38,623 | INFO | cp1125 should target any language(s) of ['Ukrainian']
2021-07-31 23:54:38,626 | INFO | We detected language [('Ukrainian', 0.8243)] using cp1125
2021-07-31 23:54:38,626 | INFO | Using cp1125 code page we detected the following languages: [('Ukrainian', 0.8243)]
2021-07-31 23:54:38,627 | INFO | cp866 passed initial chaos probing. Mean measured chaos is 0.000000 %
2021-07-31 23:54:38,629 | INFO | cp866 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian']
2021-07-31 23:54:38,634 | INFO | We detected language [('Russian', 0.8948), ('Bulgarian', 0.8596), ('Ukrainian', 0.8243), ('Serbian', 0.6974)] using cp866
2021-07-31 23:54:38,635 | INFO | Using cp866 code page we detected the following languages: [('Ukrainian', 0.8243)]
cp866 will be categorized as a sub match of cp1125 as the result is the same.
If both decoders would produce different results, limiting cp1125 to Ukrainian would matter. But in that case, it would not.
But, never the less, there is room for improvement. I have a few ideas.
- Patch the sub match factoring and prefer the most coherent one (language detection) on top.
- Patch the function
cd.encoding_languages
to limit returned languages for those specifics charsets.
...Like this.
@lru_cache()
def encoding_languages(iana_name: str) -> List[str]:
"""
Single-byte encoding language association. Some code page are heavily linked to particular language(s).
This function does the correspondence.
"""
# Put too specific code page target language here.
if iana_name == "cp1125":
return ["Ukrainian"]
unicode_ranges = encoding_unicode_range(iana_name) # type: List[str]
primary_range = None # type: Optional[str]
.................
Before I can go any further, I need to identify all the code pages that are similar to the CP1125 case.
If it's one case, code page, I don't think we can produce a PR that addresses a single edge case.
from charset_normalizer.
An alternative solution: change the probing order so that CP866 is checked before CP1125. Then CP1125 will be marked as an alternative to CP866.
Ideally, all encodings should be probed in order of popularity. This will speed up detection of the most popular encodings. But I did not find detailed data on the popularity of encodings, other than this: https://w3techs.com/technologies/overview/character_encoding. But there is no CP866 or CP1125 at all.
As for the fact that the encodings are the same, this is not the case. If we find a file containing characters that differ in these encodings, then if the wrong encoding is selected, the text will be damaged. I believe that CP866 is more common due to the size of the countries' populations.
Perhaps you are right and there is no point in changing anything. Chardet reports that these files are CP866, and this is the correct answer. On the other hand, CP1125 is valid for both Russian and Ukrainian texts, and CP866 is valid for Russian text with symbols of European languages, which is a much rarer case.
from charset_normalizer.
An alternative solution: change the probing order so that CP866 is checked before CP1125.
This will not happen. I do not want to disturb the actual order. (Provided by from encodings.aliases import aliases
Ordered alphabetically)
I saw many people miss interpreting https://w3techs.com/technologies/overview/character_encoding for the charset detection context.
Neither Chardet nor Charset-Normalizer is an HTML charset detector. The detection concern every text document (SubRip Subtitle for ex.) And it is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.
As for the fact that the encodings are the same, this is not the case.
Chardet reports that these files are CP866, and this is the correct answer.
Maybe I expressed myself wrongly. What I said is that with the files you gave me, there is no wrong answer. And Chardet is nowhere near supporting the same amount of CP.
I believe that CP866 is more common due to the size of the countries' populations.
Yes, I believe so. But I cannot maintain an order based on popularity. I need a perpetual objective way to doing things.
Perhaps you are right and there is no point in changing anything.
In a matter of urgency, no. But I concur that something can be improved. To me, the fact that the property could_be_from_charset
output both CP1125 and IBM866 is sufficient.
As stated in the introduction https://charset-normalizer.readthedocs.io/en/latest/ I strongly advise you to expect the text to be decoded correctly, do not search for the originating encoding. Can be pointless.
from charset_normalizer.
Related Issues (20)
- [BUG] Unable to force-reinstall on Windows while in use - >=3.0.0 HOT 2
- IS this code reachable? HOT 1
- [DETECTION] no encoding found. HOT 1
- [DETECTION] EUC-KR files are not detected correctly. HOT 1
- [DETECTION] CP932 containing half-width kana characters cannot be detected correctly. HOT 1
- [BUG] Basic encoding failing and detecting as chinese, croatian and others when it is standard spanish HOT 2
- [DETECTION] Wrong encodings for GBK text HOT 1
- [BUG] Incorrect encoding detected in 3.3.1 HOT 2
- [BUG] Memory usage increase for big files HOT 1
- html file is not reported as UTF8 after conversion HOT 1
- [DETECTION] The results are not stable depending on the package version and do not work correctly HOT 4
- [Proposal] allow to disable language detection HOT 1
- Detection and Tiny sequences HOT 1
- [BUG] encoding is not detected when the input contains multiple encodings HOT 1
- TypeError: __eq__ cannot be invoked on <class 'NoneType'> and <class 'charset_normalizer.models.CharsetMatch'> HOT 1
- [BUG] Windows-1252 encoding is not detected in turkish text HOT 2
- [DETECTION] On the question of speed HOT 2
- [Proposal] Vendoring charset-normalizer in pip HOT 4
- [DETECTION] latin1 text misdetected as mac_latin2 HOT 7
- [DETECTION] fails on short input HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from charset_normalizer.