aadsm / jschardet Goto Github PK

View Code? Open in Web Editor NEW

700.0 16.0 97.0 1.4 MB

Character encoding auto-detection in JavaScript (port of python's chardet)

License: GNU Lesser General Public License v2.1

JavaScript 98.58% HTML 0.12% CSS 0.58% Shell 0.72%

character-encoding charset

jschardet's Introduction

JsChardet

Port of python's chardet (https://github.com/chardet/chardet).

License

LGPL

How To Use It

Node

npm install jschardet

var jschardet = require("jschardet")

// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }

// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }

// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
//   {encoding: "windows-1252", confidence: 0.95},
//   {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
//   {encoding: "SHIFT_JIS", confidence: 0.01}
// ]

Browser

Copy and include jschardet.min.js in your web page.

This library is also available in cdnjs at https://cdnjs.cloudflare.com/ajax/libs/jschardet/1.4.1/jschardet.min.js

Options

// See all information related to the confidence levels of each encoding.
// This is useful to see why you're not getting the expected encoding.
jschardet.enableDebug();

// Default minimum accepted confidence level is 0.20 but sometimes this is not
// enough, specially when dealing with files mostly with numbers.
// To change this to 0 to always get something or any other value that can
// work for you.
jschardet.detect(str, { minimumThreshold: 0 });

// Lock down which encodings to detect, can be useful in situations jschardet
// is giving a higher probability to encodings that you never use.
jschardet.detect(str, { detectEncodings: ["UTF-8", "windows-1252"] });

Supported Charsets

Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese)
EUC-KR and ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian)
ISO-8859-2 and windows-1250 (Hungarian)
ISO-8859-5 and windows-1251 (Bulgarian)
windows-1252
ISO-8859-7 and windows-1253 (Greek)
ISO-8859-8 and windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)
UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM)
UTF-16 BE or LE (with a BOM)
UTF-8 (with or without a BOM)
ASCII

Technical Information

I haven't been able to create tests to correctly detect:

ISO-2022-CN
windows-1250 in Hungarian
windows-1251 in Bulgarian
windows-1253 in Greek
EUC-CN

Development

Use npm run dist to update the distribution files. They're available at https://github.com/aadsm/jschardet/tree/master/dist.

Authors

Ported from python to JavaScript by António Afonso (https://github.com/aadsm/jschardet)

Transformed into an npm package by Markus Ast (https://github.com/brainafk)

jschardet's People

Contributors

Stargazers

Watchers

Forkers

podviaznikov rkusa idealhack neekey sequoiar sebc99 gpt-modules houfeng djrondon lyralei yubin mynamecat brucexin mgatorr federicospini windwhinny leoyuan kmalone75 defconcepts apezel huubap farazg u5207541 zbm2001 ws-php batsing hongyanca jiangzhuo blackpr thorstenhemann shinux xunonxyz sudazx linjinglan tarnelope jdesboeufs laziel danielmorena dfoody gycsee evaluation-alex mrcuijt ahnsinyong wangyi12358 zk1019073673 tongdada dgri cwoinc bpasero turbolento wsj1102 shinjucommunications shripalsoni04 cuanpub wanjarus njoyard victornoventa lingsamuel gyzerok winniegofighting xdsnet feiahuo laoxieh eric-heihei linjinze999 slice-dd favouredddd matooma danielleefu daizhiyu daihaidev johnstoncode 00mjk eromangag danielgindi zhangaz1 qatium yangsj liangaofeng skonair mbutsykin shixued digipost viralitygmbh imsupperkaka luhuaei marcelraschke antoineaugusti datawrapper basharovv sunnytam jackykwan-eventx rp152 yutotnh luca-rath

jschardet's Issues

Browserify everything into a single file to support CDNJS

This is needed for cdnjs/cdnjs#7056 that requires having a single file with the entire lib.

big5 incorrectly detected as windows-1252

It's correct in your big5 test case "次常用國字標準字體表",
but if i use "你好" in big5 encoded( str = "\xA7\x41\xA6\x6E"), it was detected to { encoding: 'windows-1252', confidence: 0.95 }

Usage instructions in browser

I didn't see any instructions for using this library with a file read from the browser.

I tried using the File API to read a file as text and then detect the encoding. The guessed encoding always seems to be either ascii or windows-1251. I suspect that when reading the file as text it is automatically decoded when reading using the File API (I'm not familiar enough with unicode in the browser to be sure).

I noticed that the tests all use hard-coded strings instead of files. Is there a recommended method for using this library with files in the browser? Thanks! 😄

LGPL is a license hard to work with

We are using jschardet in a corporate environment where LGPL licensed software can't be used.

We are getting errors as external-editor is pulling jschardet as dependencies instead of dev-dependency. Can we have jschardet under MIT if possible?

Unicode character problem

Every message that uses the character ç next to another Unicode returns a strange character.

Using encode: UTF-8

çã Shows how згo
çõ Shows how уш

This can only be reproduced if the message is sent from irc to discord irc can not be UTF-8

reactiflux/discord-irc#399

"Cannot find module" after Electron build.

Your library is great. But there is an issue, when I start the app compiled with Electron, I have an error message. I had a similar problem with others libraries like JSZip, js-xlsx, etc, but I found the way to use them also on Electron builds, unfortunately I wasn't lucky with jschardet.
Please can you have a look? Thanks!

Here attached a sample minimal project (to reproduce the problem):
sample-ko-with-electron-packager.zip

Uncaught Exception:
Error: Cannot find module './src'
at Module._resolveFilename (module.js:470:15)
at Function.Module._resolveFilename (/Users/dev/Desktop/App/releases/App-darwin-x64/App.app/Contents/Resources/electron.asar/common/reset-search-paths.js:35:12)
at Function.Module._load (module.js:418:25)
at Module.require (module.js:498:17)
at require (internal/module.js:20:19)
at Object. (/Users/dev/Desktop/App/releases/App-darwin-x64/App.app/Contents/Resources/app/node_modules/jschardet/index.js:1:173)
at Object. (/Users/dev/Desktop/App/releases/App-darwin-x64/App.app/Contents/Resources/app/node_modules/jschardet/index.js:2:3)
at Module._compile (module.js:571:32)
at Object.Module._extensions..js (module.js:580:10)
at Module.load (module.js:488:32)

`utf8prober` confidence function magic number "6" breaks short UTF-8 detection.

// src.utf8prober.js
this.getConfidence = function() {
        var unlike = 0.99;
        if( this._mNumOfMBChar < 6 ) {
            for( var i = 0; i < this._mNumOfMBChar; i++ ) {
                unlike *= ONE_CHAR_PROB;
            }
            return 1 - unlike;
        } else {
            return unlike;
        }
    }

This magic number makes UTF-8 text shorter than 6 chars confidence never defeat others.

A simple fix is add multibytes chars ratio check.

possible optimization?

does it have to read the whole buffer?
https://github.com/aadsm/jschardet/blob/master/src/init.js#L72-L85

if it has to read the whole buffer, can it return a string as a result?

i want to check if file that i've opened is an utf-8 file and then return it further, but I don't want to convert whole buffer to string twice

EUC-JP wrongly detected in this case that contains german umlaut

The following file detects as EUC-JP even though it is not. Seems to be caused by a single ü inside that file.

File: QuietLight.tmTheme.txt

Support converting from one encoding to another

I run a node.js CMS called DocPad. Currently we only support UTF8 however chinese users have noticed DocPad does not work with GBK encoding.

It seems that jschardet can detect the coding reasonably well, however it seems that Node.js does not support using these encodings when converting a buffer to a string - with support only for these encodings - http://nodejs.org/docs/latest/api/all.html#all_buffer

This makes me feel that the solution for us would be to detect the encoding with jschardet, then use something to convert that encoding to UTF8 or UTF16 - do you have any ideas on how this could be accomplished? And whether or not jschardet would be the correct project to handle such conversion?

Update: seems codes handles the conversion part, but not detection...

Wrong guess encoding as Windows 1252

See microsoft/vscode#33720

Test case

#!/bin/sh

foo() {
	echo "starting …"
}

Ellipsis symbol … makes vscode guess cp1252. UTF8 should have higher priority IMO

Use ECMAScript modules instead of commonjs/amd

Hello!

We use this package in our Angular 11 project. During the build we get a warning:

FILE_NAME_HERE depends on 'jschardet'. CommonJS or AMD dependencies can cause optimization bailouts.
For more info see: https://angular.io/guide/build#configuring-commonjs-dependencies

The recommendation is as follows:

It is recommended that you avoid depending on CommonJS modules in your Angular applications. Depending on CommonJS modules can prevent bundlers and minifiers from optimizing your application, which results in larger bundle sizes. Instead, it is recommended that you use ECMAScript modules in your entire application. For more information, see How CommonJS is making your bundles larger.

Would it be possible to do this? this would make our application size smaller and I believe this is also a more modern approach!

I am looking forward to your reply!

Detects ascii instead of utf-8

This data passed as a buffer results in ascii with confidence 1

Its the first stream portion from http://www.theverge.com/rss/frontpage

<!doctype html>

<!--[if lte IE 8]>   <html class="ie8 no-js">           <![endif]-->
<!--[if IE 9]>       <html class="ie9 no-js">           <![endif]-->
<!--[if gte IE 10]>  <html class="ie10 no-js">          <![endif]-->
<!--[if !IE]><!-->   <html lang="en-US" class="no-js">  <!--<![endif]-->

<head data-network="verge">

<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"beacon-5.newrelic.com","errorBeacon":"bam.nr-data.net","licenseKey":"e425f33c7f","applicationID":"754272","transactionName":"IVtWTBAMDVlXQh9HABBTXWcKFgNqQl9DRRZNR1BXFQ==","queueTime":8,"applicationTime":295,"ttGuid":"","agentToken":null,"agent":"js-agent.newrelic.com/nr-411.min.js"}</script>
<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o?o:n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);retur

Submit to JSDelivr

Can you please add this to JSDelivr so we can download it with our other scripts (via JSDelivr's concatenation feature).

See https://github.com/jsdelivr/jsdelivr/blob/master/CONTRIBUTING.md

return all encoding and confidences

It will be possible to return all of encodings and confidences? We need this in VSCODE
microsoft/vscode#36951 (comment)

SJISDistributionAnalysis misses half of SJIS characters

In chardistribution.js line 246 through 248:

if( aStr.charCodeAt(1) > 0x7F ) {
    order = -1;
}

According to sjis encoding schemes, DBCS programming guide from Microsoft and sjis encoding table, we are interested in first byte range: 0x81 -- 0x9f , 0xe0 -- 0xef, (the comment in line 235 is incorrect, the code in line 240 is correct), second byte range: 0x40 -- 0x7e, 0x80 -- 0xfc(the comment in line 236 is incorrect, it should be 0x40 -- 0x7e, 0x80 -- 0xfc).

The code for the frequent Japanese character "の" is 0x82 0xcc, since second byte 0xcc aStr.charCodeAt(1) is bigger than 0x7F , line 246 will return order = -1, hence missing it. It's order should be 328.

line 246 through 248 should look like this:

if( aStr.charCodeAt(1) < 0x40 || aStr.charCodeAt(1) === 0x7F || aStr.charCodeAt(1) > 0xFC) {`
    order = -1;
}

is version 1.4.0 released on npm?

hi, i see in the commit history that jschardet is on 1.4.0 version but npm still points to 1.3.0.

Cannot use jschardet from browser using jschardet.min.js

Hello, thanks for jschardet!

The readme instructions for browser usage suggest including jschardet.min.js in the webpage. This does not work as the file contains Node.js filesystem functions that fail, for example readFileSync.

Note that the cdnjs option for browser usage does not use Node.js and runs successfully.

UTF-8 file guessed as ISO 8859-2

Guess the encoding on the attached file. It contains emojis but is a fine UTF-8 file.

strip.sh.zip

Error in UTF-8 with spanish characters

Detects "Diseño de aplicación" how 'GB2312' encoding

can't detect SHIFT_JIS

I try to use this to detect a SHIFT_JIS file but it return {codeType: null, confidence: 0}…
I put this file here

Please add missing Git tag for v1.5.1

There is no tag in Git for release v1.5.1.

Support for stream

It would be great if encoding detection could accept stream as input, trying to detect the encoding by block, and returning a result as soon as a minimum confidence level (in an option) is reached.
It could bring some serious speed improvements for large buffers (for big webpage the encoding detection is regularly longer than 1 second).

Any idea how we could do that?

Detect encoding by looking for specific markers

I was wondering if jschardet would ever consider to understand specific markers within a file to get the encoding from. For example, XML can have an encoding in the header:

<?xml version="1.0" encoding="windows-1251"?>

and HTML as well:

<meta charset="..."/>

There may be other languages where this exists too.

Refs: microsoft/vscode#36230

UTF8 isn't recognized correctly

Hi,

I'd like to use jschardet for exactly what the original chardet was build for: parsing feeds.

But unfortunately, the parsing fails on german letters, encoded in UTF8:

var L=new Buffer([0xc3,0xa4]).toString("utf8");
console.log(jschardet.detect(L));

gives { encoding: 'windows-1252', confidence: 0.95 }

But 0xc3a4 is "ä" in utf8.

Thanks Robert

Wrong guess encoding using ISO 8859-1

I have a file saved as "ISO 8859-1" encode. When try to open, it detects as "Windows 1251" encode and the accentuations broken.

Example attached to perform this test.

example-ISO-8859-1.txt

unreliable detection - windows1250

the windows-1250 is mentioned as Hungarian, but it really is Central European, so it may also be Slovak or Czech text, or maybe even other languages. Proper naming is "Central European". Those accented characters to recognize are for example čČšŠťŤžŽéÉľĽ

Found in VSCode using this, text saved as 1250, on reopen gets detected as 1252, or others 125*, or even as ISO-8859-2 etc. Depends what subset of these nonbasic characters are in the content.

Performance regression in 2.*

In VS Code 1.41 jschardet was updated from 1.6.0 to 2.1.1, it introduced some performance problems.

Check this for more information (screens and cpuprofile):
microsoft/vscode#87205

cp1250 is not detected

It does not work with Romanian subtitle files. OpenSubtitles detects these files as "cp1250", jschardet detects the encoding as "windows-1252".

Wrong characters: ã þ º
Correct romanian special characters: ă Ă â Â î Î ş Ş ţ Ţ

Test file: http://dl.opensubtitles.org/en/download/file/1954820326.srt

I've tested with many more files though, if I use iconv-lite with "cp1250" (instead of "windows-1252" as detected) it encodes the file to "utf8" correctly.

ISO 8859 not detected in this case

Detect attached file. The result will be windows-1252

iso-8859-1.txt

Can I get the list of encoding in confidence order?

I want to suggest users the encoding list letting them pick one.

feeding tiny buffers can cause incorrect detection

I know it's only a very minor thing, but I was looking at the implementation and it struck me.

If for example I feed the single character '\xEF', the code path that checks for BOMs will not find anything. If I then feed the 2 characters '\xBB\xBF' (completing the UTF-8 BOM), the BOM checking code path is skipped. If the detector is then closed, the UTF-8 BOM is detected as windows-1252 with 95% confidence...

use vue-cli-service build has a compiled wrong

5:32 Cannot find name 'Buffer'. Do you need to install type definitions for node? Try npm i @types/node and then add node to the types field in your tsconfig.
3 | confidence: number
4 | }

5 | export function detect(buffer: Buffer | string, options?: { minimumThreshold: number }): IDetectedMap;
| ^
6 |
7 | export function enableDebug(): void;
8 |

Big5 wrongly guessed as windows-1252

> const jschardet = require("jschardet")
> jschardet.detect(Buffer.from([164,112,164,67]))
{ encoding: 'windows-1252', confidence: 0.95 }

UTF-8 or utf-8, we need to make a decision~

mostly, if a file was detected as utf-8, the lower case of utf-8 returned~
BUT, if a file was detected with confidence of 1, then the upper case of UTF-8 returned~

I don't think this is a good choice to return two kinds of utf-8, could you unify them?

Usage outside Node.js + current completion

Hi,

Thank you for your work on this port, it's most useful. Couple of questions:

Is it usable without Node.js? Looking at the code history, there seems to have been a non-Node.js version before.
What's its current completion status? (From the README, 5 charsets couldn't be tested, are these the only "missing" bits?"

Thank you.

TypeError: Cannot assign to read only property '1' of string ''

The error

ERROR TypeError: Cannot assign to read only property '1' of string '��'
    at SJISProber.feed (sjisprober.js:73)
    at MBCSGroupProber.CharSetGroupProber.feed (charsetgroupprober.js:69)
    at UniversalDetector.feed (universaldetector.js:156)
    at runUniversalDetector (index.js:52)
    at Object.push.rVdK.exports.detect (index.js:34)
    at FileReader.fileReader.onload (app.component.ts:21)
    at ZoneDelegate.invoke (zone-evergreen.js:364)
    at Object.onInvoke (core.js:28494)
    at ZoneDelegate.invoke (zone-evergreen.js:363)
    at Zone.runGuarded (zone-evergreen.js:133)

is thrown in

jschardet/src/sjisprober.js

Line 73 in 13ddd7e

this._mLastChar[1] = aBuf[0];

in my Angular 11 project, when I select cp1252.txt in the input of my AppComponent

app.component.ts

import { Component } from '@angular/core';
import * as jschardet from 'jschardet';

@Component({
  selector: 'app-root',
  template: `
      <input type="file" id="file" (change)="decode($event)">
  `
})
export class AppComponent {
  decode(e: any) {
    const file = e.target.files[0];
    console.log(typeof file);
    const fileReader = new FileReader();
    fileReader.onload = function() {
      const array = new Uint8Array(fileReader.result as ArrayBuffer);
      let string = "";
      for (let i = 0; i < array.length; ++i) {
        string += String.fromCharCode(array[i]);
      }
      console.log(jschardet.detect(string));
    };
    fileReader.readAsArrayBuffer(file);
  }
}

I only have this issue in a larger Angular project. I tried to reproduce it in a new Angular project as minimal example, but the TypeError is not thrown in this project. I suspect that some (build) configuration is different between the two projects, but I can't think of what that might be. Do you have any idea?

Besides, what should

this._mLastChar[1] = aBuf[0];

actually do? this._mLastChar is a string. They are immutable in JavaScript, aren't they?

https://stackoverflow.com/q/68568242/1065654

gb2312 not ok

test web http://www.zgyb.cn/

var Crawler = require("crawler"),
    url = require('url'),
    levelup = require('levelup'),
    fs = require('fs'),
    db = levelup('./mydb');

var c = new Crawler({
    maxConnections : 10,
    callback : function (error, response, $) 
    {
        if(error)
        {

              return;
        }

        // X-Safe-Firewall
        var aHds = "Content-Security-Policy,X-Webkit-CSP,X-Content-Security-Policy,X-Frame-Options,X-XSS-Protection,X-Content-Type-Options,Server,X-Powered-By".split(",");
        // 当前url 和 标题
        console.log(response["uri"] + " " + $("title").text().trim());


    }
});

 c.queue({ userAgent:"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36",
                            uri : “http://www.zgyb.cn/”,
                            encoding:"utf8",
                            timeout:2000
                          });

remove with statements

Currently jschardet can not be used under JavaScript strict mode because of with statement in following files :
escsm.js
mbcssm.js.

also line 372 in mbcssm.js should be "jschardet.UCS2BE_cls" instead of "UCS2BE_cls"

Distinct macintosh from ISO-8859-2?

Hi,

When I try to detect the encoding of a text (macintosh) file content, your tool returns ISO-8859-2.

Then I try to decode, it doesn't works with that tool : https://github.com/mathiasbynens/iso-8859-2

But it works with this one : https://github.com/mathiasbynens/macintosh

Is there a way to distinct them, please?

TextDecoder label incompatibility: 'ibm855' and 'maccyrillic' (should be x-mac-cyrillic?)

I've found two instances where the encoding string output by jschardet doesn't conform to what I think is effectively the defacto standard, in Web-browser based TextDecoder/TextEncoder:

TextDecoder Docs:
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/TextDecoder

This is the standard list of encodings and aliases:
https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings

And:
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/encoding

Official Spec:
https://encoding.spec.whatwg.org/#dom-textdecoder-encoding

jschardet outputs 'maccyrillic', but seems likely this is better called 'x-mac-cyrillic' which is officially documented here:
https://en.wikipedia.org/wiki/Mac_OS_Cyrillic_encoding

'x-mac-cyrillic' is the correct one per the TextDecoder standard.

Another tricky one is ibm855:
https://en.wikipedia.org/wiki/Code_page_855

This has no equivalent in TextDecoder, but this is noted as being in the ISO 8859 group, perhaps ISO-8859-2 (?).

Options:

Change maccyrillic -> x-mac-cyrillic, its standard name
Do nothing, have end users translate
Lobby for maccyrillic to be an alias for x-mac-cyrillic (requires changes to every browser)

Recommend (1) and further discussion on recommended standard ISO mapping for ibm855.

SHIFT-JIS not detected in this case

Detect attached file. The result will be windows-1252

shift-jis.txt

UTF-8 encoding of Degree Symbol

The issue I'm having is because of the degree symbol:
UTF-8 \xc2\xb0
http://www.fileformat.info/info/unicode/char/b0/index.htm

Below, I include the boiled-down calls. My true testing data sample includes properly formatted XML; but through testing I found that having more and more text does not affect the confidence or output of the "jschardet.detect()" call.

With 1, 2, or 3 degree symbols, it detects as windows-1252 (which parses with an extra \xc2 for each, since it's supposed to be UTF-8)
jschardet.detect('\xc2\xb0');

With 4 degree symbols, it detects as EUC-KR
jschardet.detect('\xc2\xb0\xc2\xb0\xc2\xb0\xc2\xb0');

GB18030 encoded file incorrectly detected as gb2312

atom/encoding-selector#65

Steps to Reproduce

https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar

Open in Atom
Select "Auto Detect" encoding,

Expected behavior: Detects the encoding of the file as GB18030.
iconv -f GB18030 -t UTF-8 userdb_panda.yar works

Actual behavior: Atom auto detects the encoding as gb2312, 'undefined encoding'

iconv fails to convert from GB2312, but works with GB18030:

iconv -f GB2312 -t UTF-8 userdb_panda.yar
iconv: illegal input sequence at position 29230

Reproduces how often: Always

can't detect gb2312

detect this rss site http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss error

 { encoding: null, confidence: 0 }

RFE: Support detecting CP-1256 Arabic language code pages

Hi,

What does it take to create the statistical model to support win-1256 code pages? Thanks

GBK not detected in this case

file: Untitled-1.txt
output with debug:

EUC-TW prober hit error at byte 0

windows-1251 confidence = 0, below negative shortcut threshhold 0.05

UTF-8 not active

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.6666666666666666

GB2312 confidence = 0

EUC-KR confidence = 0

Big5 confidence = 0

EUC-TW not active

UTF-8 not active

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.6666666666666666

GB2312 confidence = 0

EUC-KR confidence = 0

Big5 confidence = 0

EUC-TW not active

EUC-JP confidence 0.6666666666666666
windows-1251 confidence = 0

KOI8-R confidence = 0

ISO-8859-5 confidence = 0.17355247990093067

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0.17484312375564148

windows-1251 not active

ISO-8859-2 confidence = 0.01

windows-1250 confidence = 0.01

TIS-620 confidence = 0.4613092462002095

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

windows-1251 confidence = 0

KOI8-R confidence = 0

ISO-8859-5 confidence = 0.17355247990093067

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0.17484312375564148

windows-1251 not active

ISO-8859-2 confidence = 0.01

windows-1250 confidence = 0.01

TIS-620 confidence = 0.4613092462002095

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

TIS-620 confidence 0.4613092462002095
windows-1252 confidence 0.95
{ encoding: 'windows-1252', confidence: 0.95 }

Clarify License

Can you include a LICENSE file in this repository, similar to the one in https://github.com/chardet/chardet/blob/master/LICENSE

Regression in V2: EUC-KR string ("ÇÑ±¹¾î") is incorrectly detected as 'windows-1252'; works in V1-4-1.

Hi everyone, I hope you are well and thank you for this excellent work. We are looking at using this project to auto-correct filenames in archives (e.g., TAR) that do not support unicode.

Here's what should be a very simple real-world example. The issue repros for me in 2-1-1 and 2-2-1

TEST STRING:
ÇÑ±¹¾î.txt

EXPECTED (V1-4-1 output):
{ encoding: "EUC-KR", confidence: 0.99 }

ACTUAL:
{ encoding: "windows-1252", confidence: 0.95 }

MORE INFO:
Works as expected in 1-4-1, for example, in the demo fiddle:
https://jsfiddle.net/vbogvqa8/

Issue repros even when removing the '.txt' extension.

I tried all the many different ways Including:
jschardet.detect('\xc7\xd1\xb1\xb9\xbe\xee\x2e')

I was sure to very carefully check the input values to the detection string:

adding 199, Ç
adding 209, Ñ
adding 177, ±
adding 185, ¹
adding 190, ¾
adding 238, î
adding 46, .
adding 116, t
adding 120, x
adding 116, t

The debug log indicates early failure of EUC-KR confidence = 0.01, I'm wondering if we're hitting some hard-coded heuristic there (maybe minimum string length or something?).

DEBUG LOG

SHIFT_JIS prober hit error at byte 5

jschardet2-2-1.min.js:631 EUC-TW prober hit error at byte 2

jschardet2-2-1.min.js:155 UTF-8 not active

jschardet2-2-1.min.js:155 SHIFT_JIS not active

jschardet2-2-1.min.js:155 EUC-JP confidence = 0.01

jschardet2-2-1.min.js:155 GB2312 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-KR confidence = 0.01

jschardet2-2-1.min.js:155 Big5 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-TW not active

jschardet2-2-1.min.js:155 UTF-8 not active

jschardet2-2-1.min.js:155 SHIFT_JIS not active

jschardet2-2-1.min.js:155 EUC-JP confidence = 0.01

jschardet2-2-1.min.js:155 GB2312 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-KR confidence = 0.01

jschardet2-2-1.min.js:155 Big5 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-TW not active

jschardet2-2-1.min.js:661 EUC-JP confidence 0.01
jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 KOI8-R confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 MacCyrillic confidence = 0.01

jschardet2-2-1.min.js:155 IBM866 confidence = 0.01

jschardet2-2-1.min.js:155 IBM855 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-7 confidence = 0

jschardet2-2-1.min.js:155 windows-1253 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-2 confidence = 0

jschardet2-2-1.min.js:155 windows-1250 confidence = 0

jschardet2-2-1.min.js:155 TIS-620 confidence = 0.22488825752260216

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 KOI8-R confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 MacCyrillic confidence = 0.01

jschardet2-2-1.min.js:155 IBM866 confidence = 0.01

jschardet2-2-1.min.js:155 IBM855 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-7 confidence = 0

jschardet2-2-1.min.js:155 windows-1253 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-2 confidence = 0

jschardet2-2-1.min.js:155 windows-1250 confidence = 0

jschardet2-2-1.min.js:155 TIS-620 confidence = 0.22488825752260216

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:661 TIS-620 confidence 0.22488825752260216
jschardet2-2-1.min.js:661 windows-1252 confidence 0.95

Result of euc-kr is different from python chardet library

MY EUC-KR DATA

This file has been encoded in EUC-KR and it is detected as ISO-8859-2. However, chardet which is python library detects it correctly as EUC-KR.

aadsm / jschardet Goto Github PK

jschardet's Introduction

JsChardet

License

How To Use It

Node

Browser

Options

Supported Charsets

Technical Information

Development

Authors

jschardet's People

Contributors

Stargazers

Watchers

Forkers

jschardet's Issues

Steps to Reproduce

Recommend Projects

Recommend Topics

Recommend Org