Giter Site home page Giter Site logo

aadsm / jschardet Goto Github PK

View Code? Open in Web Editor NEW
700.0 16.0 97.0 1.4 MB

Character encoding auto-detection in JavaScript (port of python's chardet)

License: GNU Lesser General Public License v2.1

JavaScript 98.58% HTML 0.12% CSS 0.58% Shell 0.72%
character-encoding charset

jschardet's Introduction

NPM

JsChardet

Port of python's chardet (https://github.com/chardet/chardet).

License

LGPL

How To Use It

Node

npm install jschardet
var jschardet = require("jschardet")

// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }

// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }

// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
//   {encoding: "windows-1252", confidence: 0.95},
//   {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
//   {encoding: "SHIFT_JIS", confidence: 0.01}
// ]

Browser

Copy and include jschardet.min.js in your web page.

This library is also available in cdnjs at https://cdnjs.cloudflare.com/ajax/libs/jschardet/1.4.1/jschardet.min.js

Options

// See all information related to the confidence levels of each encoding.
// This is useful to see why you're not getting the expected encoding.
jschardet.enableDebug();

// Default minimum accepted confidence level is 0.20 but sometimes this is not
// enough, specially when dealing with files mostly with numbers.
// To change this to 0 to always get something or any other value that can
// work for you.
jschardet.detect(str, { minimumThreshold: 0 });

// Lock down which encodings to detect, can be useful in situations jschardet
// is giving a higher probability to encodings that you never use.
jschardet.detect(str, { detectEncodings: ["UTF-8", "windows-1252"] });

Supported Charsets

  • Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese)
  • EUC-KR and ISO-2022-KR (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian)
  • ISO-8859-2 and windows-1250 (Hungarian)
  • ISO-8859-5 and windows-1251 (Bulgarian)
  • windows-1252
  • ISO-8859-7 and windows-1253 (Greek)
  • ISO-8859-8 and windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)
  • UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM)
  • UTF-16 BE or LE (with a BOM)
  • UTF-8 (with or without a BOM)
  • ASCII

Technical Information

I haven't been able to create tests to correctly detect:

  • ISO-2022-CN
  • windows-1250 in Hungarian
  • windows-1251 in Bulgarian
  • windows-1253 in Greek
  • EUC-CN

Development

Use npm run dist to update the distribution files. They're available at https://github.com/aadsm/jschardet/tree/master/dist.

Authors

Ported from python to JavaScript by António Afonso (https://github.com/aadsm/jschardet)

Transformed into an npm package by Markus Ast (https://github.com/brainafk)

jschardet's People

Contributors

aadsm avatar antoineaugusti avatar blackpr avatar bpasero avatar danielgindi avatar dfoody avatar gyzerok avatar idealhack avatar jdesboeufs avatar kmalone75 avatar lingsamuel avatar rkusa avatar the-compiler avatar yutotnh avatar zachasme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jschardet's Issues

big5 incorrectly detected as windows-1252

It's correct in your big5 test case "次常用國字標準字體表",
but if i use "你好" in big5 encoded( str = "\xA7\x41\xA6\x6E"), it was detected to { encoding: 'windows-1252', confidence: 0.95 }

Usage instructions in browser

I didn't see any instructions for using this library with a file read from the browser.

I tried using the File API to read a file as text and then detect the encoding. The guessed encoding always seems to be either ascii or windows-1251. I suspect that when reading the file as text it is automatically decoded when reading using the File API (I'm not familiar enough with unicode in the browser to be sure).

I noticed that the tests all use hard-coded strings instead of files. Is there a recommended method for using this library with files in the browser? Thanks! 😄

LGPL is a license hard to work with

We are using jschardet in a corporate environment where LGPL licensed software can't be used.

We are getting errors as external-editor is pulling jschardet as dependencies instead of dev-dependency. Can we have jschardet under MIT if possible?

Unicode character problem

Every message that uses the character ç next to another Unicode returns a strange character.

Using encode: UTF-8

çã Shows how згo
çõ Shows how уш

This can only be reproduced if the message is sent from irc to discord irc can not be UTF-8

reactiflux/discord-irc#399

"Cannot find module" after Electron build.

Your library is great. But there is an issue, when I start the app compiled with Electron, I have an error message. I had a similar problem with others libraries like JSZip, js-xlsx, etc, but I found the way to use them also on Electron builds, unfortunately I wasn't lucky with jschardet.
Please can you have a look? Thanks!

Here attached a sample minimal project (to reproduce the problem):
sample-ko-with-electron-packager.zip

Uncaught Exception:
Error: Cannot find module './src'
at Module._resolveFilename (module.js:470:15)
at Function.Module._resolveFilename (/Users/dev/Desktop/App/releases/App-darwin-x64/App.app/Contents/Resources/electron.asar/common/reset-search-paths.js:35:12)
at Function.Module._load (module.js:418:25)
at Module.require (module.js:498:17)
at require (internal/module.js:20:19)
at Object. (/Users/dev/Desktop/App/releases/App-darwin-x64/App.app/Contents/Resources/app/node_modules/jschardet/index.js:1:173)
at Object. (/Users/dev/Desktop/App/releases/App-darwin-x64/App.app/Contents/Resources/app/node_modules/jschardet/index.js:2:3)
at Module._compile (module.js:571:32)
at Object.Module._extensions..js (module.js:580:10)
at Module.load (module.js:488:32)

`utf8prober` confidence function magic number "6" breaks short UTF-8 detection.

// src.utf8prober.js
this.getConfidence = function() {
        var unlike = 0.99;
        if( this._mNumOfMBChar < 6 ) {
            for( var i = 0; i < this._mNumOfMBChar; i++ ) {
                unlike *= ONE_CHAR_PROB;
            }
            return 1 - unlike;
        } else {
            return unlike;
        }
    }

This magic number makes UTF-8 text shorter than 6 chars confidence never defeat others.

A simple fix is add multibytes chars ratio check.

Support converting from one encoding to another

I run a node.js CMS called DocPad. Currently we only support UTF8 however chinese users have noticed DocPad does not work with GBK encoding.

It seems that jschardet can detect the coding reasonably well, however it seems that Node.js does not support using these encodings when converting a buffer to a string - with support only for these encodings - http://nodejs.org/docs/latest/api/all.html#all_buffer

This makes me feel that the solution for us would be to detect the encoding with jschardet, then use something to convert that encoding to UTF8 or UTF16 - do you have any ideas on how this could be accomplished? And whether or not jschardet would be the correct project to handle such conversion?

Update: seems codes handles the conversion part, but not detection...

Use ECMAScript modules instead of commonjs/amd

Hello!

We use this package in our Angular 11 project. During the build we get a warning:

FILE_NAME_HERE depends on 'jschardet'. CommonJS or AMD dependencies can cause optimization bailouts.
For more info see: https://angular.io/guide/build#configuring-commonjs-dependencies

The recommendation is as follows:

It is recommended that you avoid depending on CommonJS modules in your Angular applications. Depending on CommonJS modules can prevent bundlers and minifiers from optimizing your application, which results in larger bundle sizes. Instead, it is recommended that you use ECMAScript modules in your entire application. For more information, see How CommonJS is making your bundles larger.


Would it be possible to do this? this would make our application size smaller and I believe this is also a more modern approach!

I am looking forward to your reply!

Detects ascii instead of utf-8

This data passed as a buffer results in ascii with confidence 1

Its the first stream portion from http://www.theverge.com/rss/frontpage

<!doctype html>

<!--[if lte IE 8]>   <html class="ie8 no-js">           <![endif]-->
<!--[if IE 9]>       <html class="ie9 no-js">           <![endif]-->
<!--[if gte IE 10]>  <html class="ie10 no-js">          <![endif]-->
<!--[if !IE]><!-->   <html lang="en-US" class="no-js">  <!--<![endif]-->

<head data-network="verge">

<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"beacon-5.newrelic.com","errorBeacon":"bam.nr-data.net","licenseKey":"e425f33c7f","applicationID":"754272","transactionName":"IVtWTBAMDVlXQh9HABBTXWcKFgNqQl9DRRZNR1BXFQ==","queueTime":8,"applicationTime":295,"ttGuid":"","agentToken":null,"agent":"js-agent.newrelic.com/nr-411.min.js"}</script>
<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o?o:n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);retur

SJISDistributionAnalysis misses half of SJIS characters

In chardistribution.js line 246 through 248:

if( aStr.charCodeAt(1) > 0x7F ) {
    order = -1;
}

According to sjis encoding schemes, DBCS programming guide from Microsoft and sjis encoding table, we are interested in first byte range: 0x81 -- 0x9f , 0xe0 -- 0xef, (the comment in line 235 is incorrect, the code in line 240 is correct), second byte range: 0x40 -- 0x7e, 0x80 -- 0xfc(the comment in line 236 is incorrect, it should be 0x40 -- 0x7e, 0x80 -- 0xfc).

The code for the frequent Japanese character "の" is 0x82 0xcc, since second byte 0xcc aStr.charCodeAt(1) is bigger than 0x7F , line 246 will return order = -1, hence missing it. It's order should be 328.

line 246 through 248 should look like this:

if( aStr.charCodeAt(1) < 0x40 || aStr.charCodeAt(1) === 0x7F || aStr.charCodeAt(1) > 0xFC) {`
    order = -1;
}

Cannot use jschardet from browser using jschardet.min.js

Hello, thanks for jschardet!

The readme instructions for browser usage suggest including jschardet.min.js in the webpage. This does not work as the file contains Node.js filesystem functions that fail, for example readFileSync.

Note that the cdnjs option for browser usage does not use Node.js and runs successfully.

can't detect SHIFT_JIS

I try to use this to detect a SHIFT_JIS file but it return {codeType: null, confidence: 0}…
I put this file here

Support for stream

It would be great if encoding detection could accept stream as input, trying to detect the encoding by block, and returning a result as soon as a minimum confidence level (in an option) is reached.
It could bring some serious speed improvements for large buffers (for big webpage the encoding detection is regularly longer than 1 second).

Any idea how we could do that?

Detect encoding by looking for specific markers

I was wondering if jschardet would ever consider to understand specific markers within a file to get the encoding from. For example, XML can have an encoding in the header:

<?xml version="1.0" encoding="windows-1251"?>

and HTML as well:

<meta charset="..."/>

There may be other languages where this exists too.

Refs: microsoft/vscode#36230

UTF8 isn't recognized correctly

Hi,

I'd like to use jschardet for exactly what the original chardet was build for: parsing feeds.

But unfortunately, the parsing fails on german letters, encoded in UTF8:

var L=new Buffer([0xc3,0xa4]).toString("utf8");
console.log(jschardet.detect(L));

gives { encoding: 'windows-1252', confidence: 0.95 }

But 0xc3a4 is "ä" in utf8.

Thanks Robert

unreliable detection - windows1250

the windows-1250 is mentioned as Hungarian, but it really is Central European, so it may also be Slovak or Czech text, or maybe even other languages. Proper naming is "Central European". Those accented characters to recognize are for example čČšŠťŤžŽéÉľĽ

Found in VSCode using this, text saved as 1250, on reopen gets detected as 1252, or others 125*, or even as ISO-8859-2 etc. Depends what subset of these nonbasic characters are in the content.

cp1250 is not detected

It does not work with Romanian subtitle files. OpenSubtitles detects these files as "cp1250", jschardet detects the encoding as "windows-1252".

Wrong characters: ã þ º
Correct romanian special characters: ă Ă â Â î Î ş Ş ţ Ţ

Test file: http://dl.opensubtitles.org/en/download/file/1954820326.srt

I've tested with many more files though, if I use iconv-lite with "cp1250" (instead of "windows-1252" as detected) it encodes the file to "utf8" correctly.

feeding tiny buffers can cause incorrect detection

I know it's only a very minor thing, but I was looking at the implementation and it struck me.

If for example I feed the single character '\xEF', the code path that checks for BOMs will not find anything. If I then feed the 2 characters '\xBB\xBF' (completing the UTF-8 BOM), the BOM checking code path is skipped. If the detector is then closed, the UTF-8 BOM is detected as windows-1252 with 95% confidence...

use vue-cli-service build has a compiled wrong

5:32 Cannot find name 'Buffer'. Do you need to install type definitions for node? Try npm i @types/node and then add node to the types field in your tsconfig.
3 | confidence: number
4 | }

5 | export function detect(buffer: Buffer | string, options?: { minimumThreshold: number }): IDetectedMap;
| ^
6 |
7 | export function enableDebug(): void;
8 |

UTF-8 or utf-8, we need to make a decision~

mostly, if a file was detected as utf-8, the lower case of utf-8 returned~
BUT, if a file was detected with confidence of 1, then the upper case of UTF-8 returned~

I don't think this is a good choice to return two kinds of utf-8, could you unify them?

Usage outside Node.js + current completion

Hi,

Thank you for your work on this port, it's most useful. Couple of questions:

  • Is it usable without Node.js? Looking at the code history, there seems to have been a non-Node.js version before.
  • What's its current completion status? (From the README, 5 charsets couldn't be tested, are these the only "missing" bits?"

Thank you.

TypeError: Cannot assign to read only property '1' of string ''

The error

ERROR TypeError: Cannot assign to read only property '1' of string '��'
    at SJISProber.feed (sjisprober.js:73)
    at MBCSGroupProber.CharSetGroupProber.feed (charsetgroupprober.js:69)
    at UniversalDetector.feed (universaldetector.js:156)
    at runUniversalDetector (index.js:52)
    at Object.push.rVdK.exports.detect (index.js:34)
    at FileReader.fileReader.onload (app.component.ts:21)
    at ZoneDelegate.invoke (zone-evergreen.js:364)
    at Object.onInvoke (core.js:28494)
    at ZoneDelegate.invoke (zone-evergreen.js:363)
    at Zone.runGuarded (zone-evergreen.js:133)

is thrown in

this._mLastChar[1] = aBuf[0];

in my Angular 11 project, when I select cp1252.txt in the input of my AppComponent

app.component.ts

import { Component } from '@angular/core';
import * as jschardet from 'jschardet';

@Component({
  selector: 'app-root',
  template: `
      <input type="file" id="file" (change)="decode($event)">
  `
})
export class AppComponent {
  decode(e: any) {
    const file = e.target.files[0];
    console.log(typeof file);
    const fileReader = new FileReader();
    fileReader.onload = function() {
      const array = new Uint8Array(fileReader.result as ArrayBuffer);
      let string = "";
      for (let i = 0; i < array.length; ++i) {
        string += String.fromCharCode(array[i]);
      }
      console.log(jschardet.detect(string));
    };
    fileReader.readAsArrayBuffer(file);
  }
}

I only have this issue in a larger Angular project. I tried to reproduce it in a new Angular project as minimal example, but the TypeError is not thrown in this project. I suspect that some (build) configuration is different between the two projects, but I can't think of what that might be. Do you have any idea?

Besides, what should

this._mLastChar[1] = aBuf[0];

actually do? this._mLastChar is a string. They are immutable in JavaScript, aren't they?


https://stackoverflow.com/q/68568242/1065654

gb2312 not ok

test web http://www.zgyb.cn/

var Crawler = require("crawler"),
    url = require('url'),
    levelup = require('levelup'),
    fs = require('fs'),
    db = levelup('./mydb');

var c = new Crawler({
    maxConnections : 10,
    callback : function (error, response, $) 
    {
        if(error)
        {

              return;
        }

        // X-Safe-Firewall
        var aHds = "Content-Security-Policy,X-Webkit-CSP,X-Content-Security-Policy,X-Frame-Options,X-XSS-Protection,X-Content-Type-Options,Server,X-Powered-By".split(",");
        // 当前url 和 标题
        console.log(response["uri"] + " " + $("title").text().trim());


    }
});

 c.queue({ userAgent:"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36",
                            uri : “http://www.zgyb.cn/”,
                            encoding:"utf8",
                            timeout:2000
                          });

remove with statements

Currently jschardet can not be used under JavaScript strict mode because of with statement in following files :
escsm.js
mbcssm.js.

also line 372 in mbcssm.js should be "jschardet.UCS2BE_cls" instead of "UCS2BE_cls"

TextDecoder label incompatibility: 'ibm855' and 'maccyrillic' (should be x-mac-cyrillic?)

I've found two instances where the encoding string output by jschardet doesn't conform to what I think is effectively the defacto standard, in Web-browser based TextDecoder/TextEncoder:

TextDecoder Docs:
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/TextDecoder

This is the standard list of encodings and aliases:
https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings

And:
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/encoding

Official Spec:
https://encoding.spec.whatwg.org/#dom-textdecoder-encoding

jschardet outputs 'maccyrillic', but seems likely this is better called 'x-mac-cyrillic' which is officially documented here:
https://en.wikipedia.org/wiki/Mac_OS_Cyrillic_encoding

'x-mac-cyrillic' is the correct one per the TextDecoder standard.

Another tricky one is ibm855:
https://en.wikipedia.org/wiki/Code_page_855

This has no equivalent in TextDecoder, but this is noted as being in the ISO 8859 group, perhaps ISO-8859-2 (?).

Options:

  1. Change maccyrillic -> x-mac-cyrillic, its standard name
  2. Do nothing, have end users translate
  3. Lobby for maccyrillic to be an alias for x-mac-cyrillic (requires changes to every browser)

Recommend (1) and further discussion on recommended standard ISO mapping for ibm855.

UTF-8 encoding of Degree Symbol

The issue I'm having is because of the degree symbol:
UTF-8 \xc2\xb0
http://www.fileformat.info/info/unicode/char/b0/index.htm

Below, I include the boiled-down calls. My true testing data sample includes properly formatted XML; but through testing I found that having more and more text does not affect the confidence or output of the "jschardet.detect()" call.

With 1, 2, or 3 degree symbols, it detects as windows-1252 (which parses with an extra \xc2 for each, since it's supposed to be UTF-8)
jschardet.detect('\xc2\xb0');

With 4 degree symbols, it detects as EUC-KR
jschardet.detect('\xc2\xb0\xc2\xb0\xc2\xb0\xc2\xb0');

GB18030 encoded file incorrectly detected as gb2312

atom/encoding-selector#65

Steps to Reproduce

https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar

  • Open in Atom
  • Select "Auto Detect" encoding,

Expected behavior: Detects the encoding of the file as GB18030.
iconv -f GB18030 -t UTF-8 userdb_panda.yar works

Actual behavior: Atom auto detects the encoding as gb2312, 'undefined encoding'
atom_gb2312_undefined

iconv fails to convert from GB2312, but works with GB18030:

iconv -f GB2312 -t UTF-8 userdb_panda.yar
iconv: illegal input sequence at position 29230

Reproduces how often: Always

GBK not detected in this case

EUC-TW prober hit error at byte 0

windows-1251 confidence = 0, below negative shortcut threshhold 0.05

UTF-8 not active

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.6666666666666666

GB2312 confidence = 0

EUC-KR confidence = 0

Big5 confidence = 0

EUC-TW not active

UTF-8 not active

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.6666666666666666

GB2312 confidence = 0

EUC-KR confidence = 0

Big5 confidence = 0

EUC-TW not active

EUC-JP confidence 0.6666666666666666
windows-1251 confidence = 0

KOI8-R confidence = 0

ISO-8859-5 confidence = 0.17355247990093067

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0.17484312375564148

windows-1251 not active

ISO-8859-2 confidence = 0.01

windows-1250 confidence = 0.01

TIS-620 confidence = 0.4613092462002095

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

windows-1251 confidence = 0

KOI8-R confidence = 0

ISO-8859-5 confidence = 0.17355247990093067

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0.17484312375564148

windows-1251 not active

ISO-8859-2 confidence = 0.01

windows-1250 confidence = 0.01

TIS-620 confidence = 0.4613092462002095

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

TIS-620 confidence 0.4613092462002095
windows-1252 confidence 0.95
{ encoding: 'windows-1252', confidence: 0.95 }

Regression in V2: EUC-KR string ("Çѱ¹¾î") is incorrectly detected as 'windows-1252'; works in V1-4-1.

Hi everyone, I hope you are well and thank you for this excellent work. We are looking at using this project to auto-correct filenames in archives (e.g., TAR) that do not support unicode.

Here's what should be a very simple real-world example. The issue repros for me in 2-1-1 and 2-2-1

TEST STRING:
Çѱ¹¾î.txt

EXPECTED (V1-4-1 output):
{ encoding: "EUC-KR", confidence: 0.99 }

ACTUAL:
{ encoding: "windows-1252", confidence: 0.95 }

MORE INFO:
Works as expected in 1-4-1, for example, in the demo fiddle:
https://jsfiddle.net/vbogvqa8/

Issue repros even when removing the '.txt' extension.

I tried all the many different ways Including:
jschardet.detect('\xc7\xd1\xb1\xb9\xbe\xee\x2e')

I was sure to very carefully check the input values to the detection string:

adding 199, Ç
adding 209, Ñ
adding 177, ±
adding 185, ¹
adding 190, ¾
adding 238, î
adding 46, .
adding 116, t
adding 120, x
adding 116, t

The debug log indicates early failure of EUC-KR confidence = 0.01, I'm wondering if we're hitting some hard-coded heuristic there (maybe minimum string length or something?).

DEBUG LOG

SHIFT_JIS prober hit error at byte 5

jschardet2-2-1.min.js:631 EUC-TW prober hit error at byte 2

jschardet2-2-1.min.js:155 UTF-8 not active

jschardet2-2-1.min.js:155 SHIFT_JIS not active

jschardet2-2-1.min.js:155 EUC-JP confidence = 0.01

jschardet2-2-1.min.js:155 GB2312 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-KR confidence = 0.01

jschardet2-2-1.min.js:155 Big5 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-TW not active

jschardet2-2-1.min.js:155 UTF-8 not active

jschardet2-2-1.min.js:155 SHIFT_JIS not active

jschardet2-2-1.min.js:155 EUC-JP confidence = 0.01

jschardet2-2-1.min.js:155 GB2312 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-KR confidence = 0.01

jschardet2-2-1.min.js:155 Big5 confidence = 0.01

jschardet2-2-1.min.js:155 EUC-TW not active

jschardet2-2-1.min.js:661 EUC-JP confidence 0.01
jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 KOI8-R confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 MacCyrillic confidence = 0.01

jschardet2-2-1.min.js:155 IBM866 confidence = 0.01

jschardet2-2-1.min.js:155 IBM855 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-7 confidence = 0

jschardet2-2-1.min.js:155 windows-1253 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-2 confidence = 0

jschardet2-2-1.min.js:155 windows-1250 confidence = 0

jschardet2-2-1.min.js:155 TIS-620 confidence = 0.22488825752260216

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 KOI8-R confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 MacCyrillic confidence = 0.01

jschardet2-2-1.min.js:155 IBM866 confidence = 0.01

jschardet2-2-1.min.js:155 IBM855 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-7 confidence = 0

jschardet2-2-1.min.js:155 windows-1253 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-5 confidence = 0

jschardet2-2-1.min.js:155 windows-1251 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-2 confidence = 0

jschardet2-2-1.min.js:155 windows-1250 confidence = 0

jschardet2-2-1.min.js:155 TIS-620 confidence = 0.22488825752260216

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:155 ISO-8859-8 confidence = 0

jschardet2-2-1.min.js:661 TIS-620 confidence 0.22488825752260216
jschardet2-2-1.min.js:661 windows-1252 confidence 0.95

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.