Giter Site home page Giter Site logo

unorm's People

Contributors

aaronshaf avatar bendkt avatar jakechampion avatar mpcref avatar openben avatar phadej avatar sonnyp avatar walling avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unorm's Issues

Compression on build

The udata variable obviously has size issues, already pointed out in #17, but I was thinking that pre-compressing it in the build phase would be able to significantly cut it down in size. For example, using the following process to compress that variable's content:

  1. Stringify the JSON.
  2. Compress it with bzip2 (near the top in both speed and efficiency).
  3. Base64-encode it. This will help remove some redundancy.
  4. Include that as a string in the built source.

And then, on load or first use on the target side:

  1. Decode it.
  2. Decompress it. (should be significantly faster than compression)
  3. Parse the JSON.
  4. Ready to use.

NFC versus W3C canonicalization

Hello,

I have been looking into canonical repr of user edited utf8 strings. I was planning to use NFC but run into this (http://www.unicode.org/faq/normalization.html) :

Q: What is the difference is between W3C normalization and Unicode normalization?
A: Unicode normalization comes in 4 flavors: C, D, KC, KD. It is C that is relevant for W3C
normalization. W3C normalization also treats character references (&#nnnn;) as equivalent to 
characters. For example, the text string "a&#xnnnn;" (where nnnn = "0301") is Unicode-
normalized since it consists only of ASCII characters, but it is not W3C-normalized, since it 
contains a representation of a combining acute accent with "a", and in normalization form C, that 
should have been normalized to U+00E1.[JC] 

would it be easy for you / would it make sense to add a w3c normalization method directly in unorm ?

Am I right to think that this normalization would provide better results for browser edited content (where maybe users will be using wysiwyg editors using the &#xnnnn; notation) ?

"Dual Licensing"

I have a question regarding the software licensing.

You specify for this that the software should be under a dual licensing (MIT and GPL). You explicitly write "and" not "or". Does this mean that the user of the software has to fulfill the obligations under both licenses equally or does it mean that the user can choose between the MIT and the GPL?

It would be great if you could give me a short answer to the question, as I am currently clarifying whether we can use the software compliantly. This would not be possible at the moment, if we have to comply with the GPL at the same time.

Thanks for your answer and thanks for your work

Make it a true shim

If the JavaScript environment supports the String.prototype.normalize function, this library should use that and export is as the unorm functions (as fallback for code that expects it).

Please tag 1.0.2

In the process of packaging etherpad-lite for Debian I have to package unorm as it is an etherpad-lite dependency.

For packaging I prefer relying on version tags in upstream Git.

Please tag version 1.0.2 (if appropriate) and also tag new release with a version number so that the Debian QA site will notify the Debian Javascript Packaging team of new upstream releases of unorm.

Please also consider adding yourself to the list of copyright holders (if this meets reality).

Thanks+Greets,
Mike ([email protected])

Something wrong with your example.js

Hi,there is something wrong with your example.js in which you take a XRegExp.But taking a XRegExp function is to use require('xregexp').XRegExp...

Not able to do string normalization by using Unorm.js

In my visual studio 2008 asp.net project I tried to use Unorm.js for string normalization. I added the unorm.js file in my project and use it on my webpage and write my own javascript function to normalize the string using UNorm.normalize('NFC', str) function

but an error occurs within the Unorm.js file's function fromData(next, cp, needfeature) as given below:

JavaScript runtime error: Unable to get property '768' of undefined or null reference

I tried following ways to normalize the string from my javascript code but no luck:

var strpwd = 'Ω';

1- nstr = UNorm.normalize('NFC',strpwd);

2- nstr = UNorm.nfc(strpwd);

Please suggest me the solution.

Look at library size

We could potentially compress the udata better. I've been researching this a bit, and we could shave a good amount of bytes by changing the data layout and save in base-36 (which is fast for JavaScript to decode with parseInt).

I also think it's an issue that the code points are layout in this binary format: yyyyyxxxxxxxxyyyyyyyy. This makes the x=0 section quite big, but many times you'd only use latin1 characters and not characters outside the BMP. A better format would be xxxxxxxxxxxxxyyyyyyyy. This creates more data rows, but you have to decompress less data in average, based on the assumption that normal text only revolves around a few Unicode scripts. Or maybe we should make a split between the way BMP and outside-BMP is stored.

I just need to look at my research files again and write the points of my research down in this issue.

unorm an streaming

Hello,

Have you considered bringing a streaming API to unorm ? do you think that it is technically feasible (is normalization only a 'local' operation where you only need to know a few characters before taking a decision ?) or do you stricly need the whole buffer string before starting to normalize ?

Do not apply polyfill by default

Hi,

In some applications String.prototype is sealed for security reason.
In that context, define a new property on String.prototype will throw.
It could be great if polyfill could be applied only when using an explicit argument or if you could provide a way to use unorm with a sealed String.prototype

Thanks
Regards

Ideas for optimizations

Just tossing some ideas out there.

The technical report outlines the Quick_Check algorithms for fast verification whether a given string is in a given normalization form: Detecting Normalization Forms.

These are YES / NO / MAYBE answers, so in theory it should be possible to implement this as two regexps:

  1. Whitelist regexp: If this regexp matches, the answer is YES, otherwise continue.
  2. Blacklist regexp: If this regexp matches, the answer is NO, otherwise MAYBE.

The regexps should be automatically generated. It is complicated by the fact that JavaScript regexps only support UCS-2 and not UTF-16, so we have to manually calculate surrogate pairs (and nested matching groups). See punycode.js.

If implemented, we could add test functions, fx. 'foo'.isNormalization('NFC'). Internally they could be used for speeding up any given normlization. Something like this: Use the whitelist regexp to match the longest prefix in this normalization, then cut that out and normalize the rest recursively. We only need to normalize the parts that are not in the normalization already, but we have to be careful about the boundaries between normalized/non-normalized. Some more strategies are outlined in the technical report: Optimization Strategies.

First and foremost, we should make a benchmark test suite, to actually gain some information whether these optimizations gives a boost in speed for long strings. And it would be nice to know how much it means for the size of the library.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.