Giter Site home page Giter Site logo

bocu's Introduction

bocu

A fast MIME-compatible Binary Ordered Compression for Unicode (BOCU). Under 2KB minified and gzipped.

Like SCSU, BOCU is designed to be useful for compressing short strings and does so by mapping runs of characters in the same small alphabet to single bytes, thus reducing Unicode text to a size comparable to that of legacy encodings, while retaining all the advantages of Unicode. Unlike SCSU, BOCU is safe for email, preserving linefeeds and other control codes.

I could not find any javascript implementations of BOCU so I wrote this one. This produces binary equivalent output of the C code. Tested on the entire unicode range. Tested in the major browsers.

Usage & Examples

sBocu = bocu.encode(sPlainText);
sPlainText = bocu.decode(sBocu);
bocu.encode('“Moscow” is Москва.'); // returns binary string: ñV. ¿Ã³¿ÇñW .¼Ã ÓÐ�����Kú
// with bytes: F1 56 2E A0 BF C3 B3 BF C7 F1 57 20 2E BC C3 20 D3 D0 8E 91 8A 82 80 4B FA

bocu.encode('foo 𝌆 bar 𝟙𝟚𝟛😎 mañana mañana 🏳️‍🌈');  
//  saved as utf-16: 84 bytes;  utf-8: 61 bytes;  deflate raw: 57 bytes  bocu1: 55 bytes; 
//  benchmark for that string: Bocu 664,117 ops/sec, gz deflate (Pako) 7,081 ops/sec

BOCU 'compression' won't do any better than utf-8 on simple English (byte per character -- it's bennefit is with other scripts that take multiple bytes with standard encoding like utf-8. The first character in a line will require multiple bytes and subsequent characters within a small script will only take one byte.) The massive speed difference between bocu and deflate is only with small strings, but that's when BOCU and SCSU are useful (for instance, saving individual strings into a database). bocu is faster on Firefox than a simple utf-8 conversion using s = unescape(encodeURIComponent(s)); while on Chrome conversion to utf-8 is a couple of times faster.

// note that the encoded lines are always still sortable 
bocu.encode('alpha'); // ±¼À¸±
bocu.encode('beta');  // ²µÄ± 
bocu.encode('gamma'); // ·±½½± 

bocu.encode('άλφα');  // d3 60 8b 96 81
bocu.encode('βήτα');  // d3 66 7e 94 81
bocu.encode('γάμμα'); // d3 67 7c 8c 8c 81

Notes

  • This will work as is in a modern browser <script src="bocu.js"></script>. This uses ES6 features like arrow functions and the spread operator. If you want this to work in older browsers use something like the Google Closure Compiler on Simple mode to minify, which currently will polyfill to ES5, or specify using @language_out ES3, or ES6 for no polyfill.

  • I've ported the core parts of the C code (not the test module) and added a wrapper to encode a string and decode. The only minor change I made to the core was not including the number of bytes used in the lead byte (which is not stored in the encoding anyway) and simply figure out the number of bytes the return integer takes. Also the code allows for customising BOCU to be non-standard and use fewer byte values which requires conditional compilation #if BOCU1_MAX_TRAIL... that js can't do natively. The small bit of conditional code has been commented out, but could be added in for those unusual cases.

  • I have not found any bocu1 files to test and can translate but can't program in C. The C program is available and can produce BOCU-1 encoded files. If testing those files by reading them with FileReader, they must be opened as binary, not text, else FileReader will get the encoding wrong.

BOCU Encoding References

Authors

Original implementation (in C):

License

MIT

bocu's People

Contributors

aamarks avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.