grigorig / ucdn Goto Github PK
View Code? Open in Web Editor NEWUnicode Database and Normalization
License: Other
Unicode Database and Normalization
License: Other
UCDN - Unicode Database and Normalization UCDN is a Unicode support library. Currently, it provides access to basic character properties contained in the Unicode Character Database and low-level normalization functions (pairwise canonical composition/decomposition and compatibility decomposition). More functionality might be provided in the future, such as additional properties, string normalization and encoding conversion. UCDN uses standard C89 with no particular dependencies or requirements except for stdint.h, and can be easily integrated into existing projects. However, it can also be used as a standalone library, and a CMake build script is provided for this. The first motivation behind UCDN development was to provide a standalone set of Unicode functions for the HarfBuzz OpenType shaping library. For this purpose, a HarfBuzz-specific wrapper is shipped along with it (hb-ucdn.h). UCDN is published under the ISC license, please see the license header in the C source code for more information. The makeunicodata.py script required for parsing Unicode database files is licensed under the PSF license, please see PYTHON-LICENSE for more information. UCDN was written by Grigori Goronzy <[email protected]>. How to Use Include ucdn.c, ucdn.h and ucdn_db.h in your project. Now, just use the functions as documented in ucdn.h. In some cases, it might be necessary to regenerate the Unicode database file. The script makeunicodedata.py (Python 3.x required) fetches the appropriate files and dumps the compressed database into ucdn_db.h.
Patch and test:
diff --git a/src/hb-ucdn/ucdn.c b/src/hb-ucdn/ucdn.c
index 30747fea..f7b33d64 100644
--- a/src/hb-ucdn/ucdn.c
+++ b/src/hb-ucdn/ucdn.c
@@ -163,7 +163,8 @@ static int hangul_pair_decompose(uint32_t code, uint32_t *a, uint32_t *b)
static int hangul_pair_compose(uint32_t *code, uint32_t a, uint32_t b)
{
- if (a >= SBASE && a < (SBASE + SCOUNT) && b >= TBASE && b < (TBASE + TCOUNT)) {
+ if (a >= SBASE && a < (SBASE + SCOUNT) && b > TBASE && b < (TBASE + TCOUNT) &&
+ !((a - SBASE) % TCOUNT)) {
/* LV,T */
*code = a + (b - TBASE);
return 3;
diff --git a/test/api/test-unicode.c b/test/api/test-unicode.c
index 6195bb28..0587c6e7 100644
--- a/test/api/test-unicode.c
+++ b/test/api/test-unicode.c
@@ -755,6 +755,10 @@ test_unicode_normalization (gconstpointer user_data)
g_assert (hb_unicode_compose (uf, 0xCE20, 0x11B8, &ab) && ab == 0xCE31);
g_assert (hb_unicode_compose (uf, 0x110E, 0x1173, &ab) && ab == 0xCE20);
+ g_assert (!hb_unicode_compose (uf, 0xAC00, 0x11A7, &ab));
+ g_assert (hb_unicode_compose (uf, 0xAC00, 0x11A8, &ab) && ab == 0xAC01);
+ g_assert (!hb_unicode_compose (uf, 0xAC01, 0x11A8, &ab));
+
/* Test decompose() */
I am writing Lua bindings for ucdn at https://github.com/deepakjois/luaucdn, with the ultimate goal of implementing a pure Lua version of the Unicode Bidirectional Algorithm.
One part of algorithm utilizes the properties Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type defined in the BidiBrackets.txt file. Both of these are derived properties, so I suppose it is possible to obtain them by using the current data and API methods made available by ucdn.
However, there is also this warning in UAX #44:
Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.
In light of that, what is your opinion on providing Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type directly in ucdn?
I imported UCDN into HarfBuzz. Running the test suite shows that UCDN returns 1 for East Asian width of U+30000, while the test suite expects 2.
Hi, i think there's a problem with the makeunicodedata.py
script. The general_category
column of the code points in a range (such as U+5000 in U+4E00..9FCC range) is incorrectly set to
UCDN_GENERAL_CATEGORY_CN, which means it is not assigned. However it should be same with the 'First>' and 'Last>' code point, which is meaningful.
Look forward for someone getting this fixed. Thanks a lot.
UCDN doesn't have any automated testing. We need at least some equivalence class testing, or ideally verification of correctness against the whole character database.
Unicode 10.0 added new annex Unicode Vertical Text Layout, this annex bring the Vertical_Orientation property that defined character orientation behavior within vertical text, which is useful for some related implementations associated with vertical text related typography.
We updated HarfBuzz already. Would be nice to get a snapshot with Unicode 9 beta out.
Just out of the oven...
I'm updating HarfBuzz to make a release later today. If you can update soon, would be nice.
I've deprecated them in HarfBuzz, and unwiring the implementations. Please remove and I'll update the HarfBuzz copy. Thanks.
Hi,
I like to thank you for UCDN again. It served HarfBuzz very well for many years. Recently I was trying to squeeze bytes out of HarfBuzz and replacing UCDN become a fruitful target. Please see:
The generator can be used for other arrays as well, in case you want to use it in other places or regenerate UCDN based on it.
Cheers,
b
Please add the typical "ifdef __cplusplus extern "C"" stuff such that including the header from C++ works as well. See hb/src/hb-ucdn/ucdn.h for example.
Currently they are not const, which means they will end up in the .data section of the library. Not good. Just add const.
Please revert 0613261. That is not true, and it breaks decomposition for U+212B ANGSTROM SIGN and possibly a few other characters. We've reverted that in HarfBuzz. Caught by HB's test-unicode.c
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.