walling / unorm Goto Github PK
View Code? Open in Web Editor NEWJavaScript Unicode 8.0 Normalization - NFC, NFD, NFKC, NFKD.
Home Page: http://git.io/unorm
License: Other
JavaScript Unicode 8.0 Normalization - NFC, NFD, NFKC, NFKD.
Home Page: http://git.io/unorm
License: Other
I'd like to include unorm in a module, and use npm run unorm test
to run the tests, but none of the tests are included in npm.
The udata
variable obviously has size issues, already pointed out in #17, but I was thinking that pre-compressing it in the build phase would be able to significantly cut it down in size. For example, using the following process to compress that variable's content:
And then, on load or first use on the target side:
unorm fails on reduceRight in IE8, preventing all following code from running when in concatenated bundle.
cf https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/Array/reduceRight#Browser_compatibility
Hello,
I have been looking into canonical repr of user edited utf8 strings. I was planning to use NFC but run into this (http://www.unicode.org/faq/normalization.html) :
Q: What is the difference is between W3C normalization and Unicode normalization?
A: Unicode normalization comes in 4 flavors: C, D, KC, KD. It is C that is relevant for W3C
normalization. W3C normalization also treats character references (&#nnnn;) as equivalent to
characters. For example, the text string "a&#xnnnn;" (where nnnn = "0301") is Unicode-
normalized since it consists only of ASCII characters, but it is not W3C-normalized, since it
contains a representation of a combining acute accent with "a", and in normalization form C, that
should have been normalized to U+00E1.[JC]
would it be easy for you / would it make sense to add a w3c normalization method directly in unorm ?
Am I right to think that this normalization would provide better results for browser edited content (where maybe users will be using wysiwyg editors using the &#xnnnn; notation) ?
I have a question regarding the software licensing.
You specify for this that the software should be under a dual licensing (MIT and GPL). You explicitly write "and" not "or". Does this mean that the user of the software has to fulfill the obligations under both licenses equally or does it mean that the user can choose between the MIT and the GPL?
It would be great if you could give me a short answer to the question, as I am currently clarifying whether we can use the software compliantly. This would not be possible at the moment, if we have to comply with the GPL at the same time.
Thanks for your answer and thanks for your work
Any plans to support browsers too? This would boil down to tweaking the export logic a bit.
P.S. I’ve made a short URL for this repository: http://git.io/unorm
If the JavaScript environment supports the String.prototype.normalize
function, this library should use that and export is as the unorm
functions (as fallback for code that expects it).
Hi,
you created great polyfill, but it is not possible to use it in some projects due to license. Is it possible to change the license to MIT only?
In the process of packaging etherpad-lite for Debian I have to package unorm as it is an etherpad-lite dependency.
For packaging I prefer relying on version tags in upstream Git.
Please tag version 1.0.2 (if appropriate) and also tag new release with a version number so that the Debian QA site will notify the Debian Javascript Packaging team of new upstream releases of unorm.
Please also consider adding yourself to the list of copyright holders (if this meets reality).
Thanks+Greets,
Mike ([email protected])
Hi,there is something wrong with your example.js in which you take a XRegExp.But taking a XRegExp function is to use require('xregexp').XRegExp...
In my visual studio 2008 asp.net project I tried to use Unorm.js for string normalization. I added the unorm.js file in my project and use it on my webpage and write my own javascript function to normalize the string using UNorm.normalize('NFC', str) function
but an error occurs within the Unorm.js file's function fromData(next, cp, needfeature) as given below:
JavaScript runtime error: Unable to get property '768' of undefined or null reference
I tried following ways to normalize the string from my javascript code but no luck:
var strpwd = 'Ω';
1- nstr = UNorm.normalize('NFC',strpwd);
2- nstr = UNorm.nfc(strpwd);
Please suggest me the solution.
We could potentially compress the udata
better. I've been researching this a bit, and we could shave a good amount of bytes by changing the data layout and save in base-36 (which is fast for JavaScript to decode with parseInt
).
I also think it's an issue that the code points are layout in this binary format: yyyyyxxxxxxxxyyyyyyyy
. This makes the x=0 section quite big, but many times you'd only use latin1
characters and not characters outside the BMP. A better format would be xxxxxxxxxxxxxyyyyyyyy
. This creates more data rows, but you have to decompress less data in average, based on the assumption that normal text only revolves around a few Unicode scripts. Or maybe we should make a split between the way BMP and outside-BMP is stored.
I just need to look at my research files again and write the points of my research down in this issue.
https://github.com/walling/unorm/blob/master/test/normalization.js#L76
It reads tests.lineNumbergth
, but I'm pretty sure it should be tests.length
.
Hello,
This is a minor issue.
In the unorm.js file, there is an unexpected comma after the last property of the unorm object. It is at line 383 in the current version on master branch :
https://github.com/walling/unorm/blob/master/lib/unorm.js
Thanks :)
... so it can be more easily used on the browser.
the shim doesn't work with node 0.11: https://travis-ci.org/walling/unorm/jobs/35543562
Hello,
Have you considered bringing a streaming API to unorm ? do you think that it is technically feasible (is normalization only a 'local' operation where you only need to know a few characters before taking a decision ?) or do you stricly need the whole buffer string before starting to normalize ?
Hi,
In some applications String.prototype is sealed for security reason.
In that context, define a new property on String.prototype will throw.
It could be great if polyfill could be applied only when using an explicit argument or if you could provide a way to use unorm with a sealed String.prototype
Thanks
Regards
Just tossing some ideas out there.
The technical report outlines the Quick_Check
algorithms for fast verification whether a given string is in a given normalization form: Detecting Normalization Forms.
These are YES / NO / MAYBE answers, so in theory it should be possible to implement this as two regexps:
The regexps should be automatically generated. It is complicated by the fact that JavaScript regexps only support UCS-2 and not UTF-16, so we have to manually calculate surrogate pairs (and nested matching groups). See punycode.js.
If implemented, we could add test functions, fx. 'foo'.isNormalization('NFC')
. Internally they could be used for speeding up any given normlization. Something like this: Use the whitelist regexp to match the longest prefix in this normalization, then cut that out and normalize the rest recursively. We only need to normalize the parts that are not in the normalization already, but we have to be careful about the boundaries between normalized/non-normalized. Some more strategies are outlined in the technical report: Optimization Strategies.
First and foremost, we should make a benchmark test suite, to actually gain some information whether these optimizations gives a boost in speed for long strings. And it would be nice to know how much it means for the size of the library.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.