Giter Site home page Giter Site logo

wikimedia / utfnormal Goto Github PK

View Code? Open in Web Editor NEW
8.0 17.0 2.0 1.14 MB

Unicode normalization functions. Mirror of https://gerrit.wikimedia.org/g/utfnormal/. See https://www.mediawiki.org/wiki/Developer_access for contributing.

Home Page: https://www.mediawiki.org/wiki/Utfnormal

License: GNU General Public License v2.0

PHP 99.97% Shell 0.03%

utfnormal's Introduction

Latest Stable Version License

utfnormal

utfnormal is a library that contains Unicode normalization routines, including both pure PHP implementations and automatic use of the 'intl' PHP extension when present.

The main function to care about is UtfNormal\Validator::cleanUp(). This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C.

If you know the string is already valid UTF-8, you can directly call UtfNormal\Validator::toNFC(), toNFK(), or toNFKC(); this will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.

Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the hangul decomposition/composition code is extra slow).

Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.

Regenerating data tables

UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode Character Database by the script "generate.php". Run "composer generate" to rebuild the tables. To fetch updated unicode data from the internet, run "composer generate -- --fetch".

Testing

Running "composer test" will run a syntax checker, PHPUnit conformance tests, and run some benchmarks using sample texts from Wikipedia. Take all benchmark numbers with large grains of salt.

PHP module extension

If the 'intl' PHP extension is present, ICU library functions are used which are MUCH faster than doing this work in pure PHP code.

It is strongly recommended to enable this module if possible: http://php.net/manual/en/intro.intl.php

Older versions of this library supported a one-off custom PHP extension, which has been dropped. If you were using this, please migrate to the intl extension.

History

This library was first introduced in MediaWiki 1.3 (r4965). It was split out of the MediaWiki codebase and published as an independent library during the MediaWiki 1.25 development cycle.


utfnormal's People

Contributors

anomiex avatar brion avatar daimona avatar jdforrester avatar krinkle avatar legoktm avatar marcoaureliowm avatar maxsem avatar paladox avatar reedy avatar somechris avatar umherirrender avatar xsavitar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

paladox runt18

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.