walling / unorm Goto Github PK

JavaScript Unicode 8.0 Normalization - NFC, NFD, NFKC, NFKD.

License: Other

JavaScript 53.93% Java 31.85% HTML 14.22%

unorm's Introduction

This is Unicode Normalizer in a Common JS module. I'm not affiliated with Matsuza, the original author of Unicode Normalizer.

Installation

npm install unorm

Polyfill

You can use this module as a polyfill for String.prototype.normalize, for example:

console.log('æøåäüö'.normalize('NFKD'));

The module uses some EcmaScript 5 features. Other browsers should use a compability shim, e.g. es5-shim.

Functions

This module exports four functions: nfc, nfd, nfkc, and nfkd; one for each Unicode normalization. In the browser the functions are exported in the unorm global. In CommonJS environments you just require the module. Functions:

unorm.nfd(str) – Canonical Decomposition
unorm.nfc(str) – Canonical Decomposition, followed by Canonical Composition
unorm.nfkd(str) – Compatibility Decomposition
unorm.nfkc(str) – Compatibility Decomposition, followed by Canonical Composition

Node.JS example

For a longer example, see examples directory.

var unorm = require('unorm');

var text =
  'The \u212B symbol invented by A. J. \u00C5ngstr\u00F6m ' +
  '(1814, L\u00F6gd\u00F6, \u2013 1874) denotes the length ' +
  '10\u207B\u00B9\u2070 m.';

var combining = /[\u0300-\u036F]/g; // Use XRegExp('\\p{M}', 'g'); see example.js.

console.log('Regular:  ' + text);
console.log('NFC:      ' + unorm.nfc(text));
console.log('NFD:      ' + unorm.nfd(text));
console.log('NFKC:     ' + unorm.nfkc(text));
console.log('NFKD: *   ' + unorm.nfkd(text).replace(combining, ''));
console.log(' * = Combining characters removed from decomposed form.');

Road map

As of November 2013. Longer term:

Look at possible optimizations (speed primarely, module size secondarily)
Adding functions to quick check normalizations: is_nfc, is_nfd, etc.

Contributers

Oleg Grenrus is helping to maintain this library. He cleaned up the code base, fixed JSHint errors, created a test suite and updated the normalization data to Unicode 6.3.

Development notes

Unicode normalization forms report
Unicode data can be found from http://www.unicode.org/Public/UCD/latest/ucd

To generate new unicode data, run:

cd src/data/src
javac UnormNormalizerBuilder.java
java UnormNormalizerBuilder

produced unormdata.js contains needed table

Execute node benchmark/benchmark.js to run simple benchmarks, if you do any changes which may affect performance.

License

This project includes the software package Unicode Normalizer 1.0.0. The software dual licensed under the MIT and GPL licenses. Here is the MIT license:

Copyright (c) 2008-2013 Matsuza <[email protected]>, Bjarke Walling <[email protected]>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to
deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.

unorm's People

Contributors

Stargazers

Watchers

Forkers

pplaquette zhuzhuaicoding nicbright craniumslows khkhstrannikkhkh phadej stephanbit parkerkane aaronshaf noscripter davehudson74 h-wang94 sebgod hildjj devlato mgreschke bendkt grassator nangal changcw83 vunb alexlaforge aifer2007 gikyougetsu danielealessandra jjfattz dorineal gmediabook yyatharthy abnerlee kevinmamaqi afifement annkornienko msfernandez mojoaxel manish364824 adrian0350 fnordware sindu12jun matsuza pinkdiamond1 loxjs samirr17 shyambhongle shakahl

unorm's Issues

Publish to bower

... so it can be more easily used on the browser.

Đ/đ character should be converted to Dj/dj

Make this a `String.prototype.normalize` polyfill

http://people.mozilla.org/~jorendorff/es6-draft.html#sec-15.5.3.26

"Dual Licensing"

I have a question regarding the software licensing.

You specify for this that the software should be under a dual licensing (MIT and GPL). You explicitly write "and" not "or". Does this mean that the user of the software has to fulfill the obligations under both licenses equally or does it mean that the user can choose between the MIT and the GPL?

It would be great if you could give me a short answer to the question, as I am currently clarifying whether we can use the software compliantly. This would not be possible at the moment, if we have to comply with the GPL at the same time.

Thanks for your answer and thanks for your work

Compression on build

The udata variable obviously has size issues, already pointed out in #17, but I was thinking that pre-compressing it in the build phase would be able to significantly cut it down in size. For example, using the following process to compress that variable's content:

Stringify the JSON.
Compress it with bzip2 (near the top in both speed and efficiency).
Base64-encode it. This will help remove some redundancy.
Include that as a string in the built source.

And then, on load or first use on the target side:

Decode it.
Decompress it. (should be significantly faster than compression)
Parse the JSON.
Ready to use.

Do not apply polyfill by default

Hi,

In some applications String.prototype is sealed for security reason.
In that context, define a new property on String.prototype will throw.
It could be great if polyfill could be applied only when using an explicit argument or if you could provide a way to use unorm with a sealed String.prototype

Thanks
Regards

Node 0.11 shim not working

the shim doesn't work with node 0.11: https://travis-ci.org/walling/unorm/jobs/35543562

Update to use Unicode 7.0

Ideas for optimizations

Just tossing some ideas out there.

The technical report outlines the Quick_Check algorithms for fast verification whether a given string is in a given normalization form: Detecting Normalization Forms.

These are YES / NO / MAYBE answers, so in theory it should be possible to implement this as two regexps:

Whitelist regexp: If this regexp matches, the answer is YES, otherwise continue.
Blacklist regexp: If this regexp matches, the answer is NO, otherwise MAYBE.

The regexps should be automatically generated. It is complicated by the fact that JavaScript regexps only support UCS-2 and not UTF-16, so we have to manually calculate surrogate pairs (and nested matching groups). See punycode.js.

If implemented, we could add test functions, fx. 'foo'.isNormalization('NFC'). Internally they could be used for speeding up any given normlization. Something like this: Use the whitelist regexp to match the longest prefix in this normalization, then cut that out and normalize the rest recursively. We only need to normalize the parts that are not in the normalization already, but we have to be careful about the boundaries between normalized/non-normalized. Some more strategies are outlined in the technical report: Optimization Strategies.

First and foremost, we should make a benchmark test suite, to actually gain some information whether these optimizations gives a boost in speed for long strings. And it would be nice to know how much it means for the size of the library.

browserify compatibility

Please tag 1.0.2

In the process of packaging etherpad-lite for Debian I have to package unorm as it is an etherpad-lite dependency.

For packaging I prefer relying on version tags in upstream Git.

Please tag version 1.0.2 (if appropriate) and also tag new release with a version number so that the Debian QA site will notify the Debian Javascript Packaging team of new upstream releases of unorm.

Please also consider adding yourself to the list of copyright holders (if this meets reality).

Thanks+Greets,
Mike ([email protected])

Something wrong with your example.js

Hi,there is something wrong with your example.js in which you take a XRegExp.But taking a XRegExp function is to use require('xregexp').XRegExp...

Look at library size

We could potentially compress the udata better. I've been researching this a bit, and we could shave a good amount of bytes by changing the data layout and save in base-36 (which is fast for JavaScript to decode with parseInt).

I also think it's an issue that the code points are layout in this binary format: yyyyyxxxxxxxxyyyyyyyy. This makes the x=0 section quite big, but many times you'd only use latin1 characters and not characters outside the BMP. A better format would be xxxxxxxxxxxxxyyyyyyyy. This creates more data rows, but you have to decompress less data in average, based on the assumption that normal text only revolves around a few Unicode scripts. Or maybe we should make a split between the way BMP and outside-BMP is stored.

I just need to look at my research files again and write the points of my research down in this issue.

Is it possible to have MIT only license for this project?

Hi,
you created great polyfill, but it is not possible to use it in some projects due to license. Is it possible to change the license to MIT only?

Typo in normalization.js

https://github.com/walling/unorm/blob/master/test/normalization.js#L76

It reads tests.lineNumbergth, but I'm pretty sure it should be tests.length.

NFC versus W3C canonicalization

Hello,

I have been looking into canonical repr of user edited utf8 strings. I was planning to use NFC but run into this (http://www.unicode.org/faq/normalization.html) :

Q: What is the difference is between W3C normalization and Unicode normalization?
A: Unicode normalization comes in 4 flavors: C, D, KC, KD. It is C that is relevant for W3C
normalization. W3C normalization also treats character references (&#nnnn;) as equivalent to 
characters. For example, the text string "a&#xnnnn;" (where nnnn = "0301") is Unicode-
normalized since it consists only of ASCII characters, but it is not W3C-normalized, since it 
contains a representation of a combining acute accent with "a", and in normalization form C, that 
should have been normalized to U+00E1.[JC]

would it be easy for you / would it make sense to add a w3c normalization method directly in unorm ?

Am I right to think that this normalization would provide better results for browser edited content (where maybe users will be using wysiwyg editors using the &#xnnnn; notation) ?

Make it a true shim

If the JavaScript environment supports the String.prototype.normalize function, this library should use that and export is as the unorm functions (as fallback for code that expects it).

unorm an streaming

Hello,

Have you considered bringing a streaming API to unorm ? do you think that it is technically feasible (is normalization only a 'local' operation where you only need to know a few characters before taking a decision ?) or do you stricly need the whole buffer string before starting to normalize ?

npm ignoring tests means `npm run unorm test` doesn't work

I'd like to include unorm in a module, and use npm run unorm test to run the tests, but none of the tests are included in npm.

Support other environments

Any plans to support browsers too? This would boil down to tweaking the export logic a bit.

P.S. I’ve made a short URL for this repository: http://git.io/unorm

Not able to do string normalization by using Unorm.js

In my visual studio 2008 asp.net project I tried to use Unorm.js for string normalization. I added the unorm.js file in my project and use it on my webpage and write my own javascript function to normalize the string using UNorm.normalize('NFC', str) function

but an error occurs within the Unorm.js file's function fromData(next, cp, needfeature) as given below:

JavaScript runtime error: Unable to get property '768' of undefined or null reference

I tried following ways to normalize the string from my javascript code but no luck:

var strpwd = 'Ω';

1- nstr = UNorm.normalize('NFC',strpwd);

2- nstr = UNorm.nfc(strpwd);

Please suggest me the solution.

Typo error in unorm object

Hello,

This is a minor issue.
In the unorm.js file, there is an unexpected comma after the last property of the unorm object. It is at line 383 in the current version on master branch :
https://github.com/walling/unorm/blob/master/lib/unorm.js

Thanks :)

Uglified version crashes in IE8

unorm fails on reduceRight in IE8, preventing all following code from running when in concatenated bundle.
cf https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/Array/reduceRight#Browser_compatibility

walling / unorm Goto Github PK

unorm's Introduction

Installation

Polyfill

Functions

Node.JS example

Road map

Contributers

Development notes

License

unorm's People

Contributors

Stargazers

Watchers

Forkers

unorm's Issues

Recommend Projects

Recommend Topics

Recommend Org