Giter Site home page Giter Site logo

eddieantonio / unicode-default-word-boundary Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 1.0 677 KB

Split words with Unicode's default word boundary specification

License: MIT License

TypeScript 73.73% JavaScript 25.24% Shell 1.03%
unicode internationalization word-boundary word-break split text

unicode-default-word-boundary's Introduction

Unicode Default Word Boundary

Build status npm

Implements the Unicode UAX #29 §4.1 default word boundary specification, for finding word breaks in multilingual text.

Use this to split words in text! Using UAX #29 is a lot smarter than the \b word boundary in JavaScript's regular expressions! Note that character classes like \b, \w, \d only work on ASCII characters.

Usage

Import the module and use the split() function:

const split = require('unicode-default-word-boundary').split;

console.log(split(`The quick (“brown”) fox can’t jump 32.3 feet, right?`));

Output:

[ 'The', 'quick', '(', '“', 'brown', '”', ')', 'fox', 'can’t', 'jump', '32.3', 'feet', ',', 'right', '?' ]

But that's not all! Try it with non-English text, like Russian:

split(`В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!`)
[ 'В', 'чащах', 'юга', 'жил', 'бы', 'цитрус', '?', 'Да', ',', 'но', 'фальшивый', 'экземпляр', '!' ]

...Hebrew:

split(`איך בלש תפס גמד רוצח עז קטנה?`);
[ 'איך', 'בלש', 'תפס', 'גמד', 'רוצח', 'עז', 'קטנה', '?' ]

...nêhiyawêwin:

split(`ᑕᐻ ᒥᔪ ᑭᓯᑲᐤ ᐊᓄᐦᐨ᙮`);
[ 'ᑕᐻ', 'ᒥᔪ ᑭᓯᑲᐤ', 'ᐊᓄᐦᐨ', '᙮' ]

...and many more!

More advanced use cases will want to use the findSpans() or the findBoundaries() function.

What doesn't work

Languages that do not have obvious word breaks, such as Chinese, Japanese, Thai, Lao, and Khmer. You'll need to use statistical or dictionary-based approaches to split words in these languages.

API Documentation

The following functions make up the primary API:

split(text: string): string[]

split() splits the text at word boundaries, returning an array of all "words" from the text that contain characters other than whitespace.

See above for examples.

findSpans(text: string): Iterable<BasicSpan>

findSpans() is a generator that yields successive basic spans from the text. A basic span is a chunk of text that is guaranteed to start at a word boundary and end at the next word boundary. In other words, basic spans are indivisible in that there are no word boundaries contained within a basic span.

A basic span has the following properties:

interface BasicSpan {
    /** Where the span starts, relative to the input text. */
    start: number;
    /** At what index does the **next** span begin. */
    end: number;
    /** How many characters are in this span. */
    length: number;
    /** The text contained within this span. */
    text: string;
}

Note that unlike, split(), findSpans() does yield spans that contain whitespace.

Example

Array.from(findSpans("Hello, world🌎!"))

Will yield spans with the following properties:

[ { start: 0, end: 5, length: 5, text: 'Hello' },
  { start: 5, end: 6, length: 1, text: ',' },
  { start: 6, end: 7, length: 1, text: ' ' },
  { start: 7, end: 12, length: 5, text: 'world' },
  { start: 12, end: 14, length: 2, text: '🌎' },
  { start: 14, end: 15, length: 1, text: '!' } ]

N.B.: findSpans() may not yield plain JavaScript objects, as shown above. The objects that findSpans() yield will adhere to the BasicSpan interface, however what findSpans() actually yields may differ from simple objects.

findBoundaries(text: string): Generator<number, void, void>

findBoundaries() is like findSpans() except it yields the index of each successive word boundary. Anecdotally, using this function directly may be faster than generating spans objects with findSpans().

Contributing and Maintaining

When maintaining this package, you might notice something strange. index.ts depends on ./src/gen/WordBreakProperty.ts, but this file does not exist! It is a generated file, created by reading Unicode property data files, downloaded from Unicode's website. These data files have been compressed and committed to this repository in libexec/:

libexec/
libexec/
├── WordBreakProperty-15.1.0.txt.gz
├── compile-word-break.js
└── emoji-data-15.1.0.txt.gz

Note that compile-word-break.js actually creates ./src/gen/WordBreakProperty.ts!

How to generate ./src/gen/WordBreakProperty.ts

When you have just cloned the repository, this file will be generated when you run npm install:

npm install

If you want to regenerate it afterwards, you can run the build script:

npm run build

Benchmarking

To run the benchmarks, you can run the following:

npm run benchmarks

If you want to compare the current implementation with a new implementation, what I do is create a new working tree called opt/:

git worktree add -b «NEW-BRANCH-NAME» opt

Then, I make changes in the working tree inside opt/, compile and run the tests, then, in the main working tree, I run the benchmarks:

cd opt/
npm install
vim       # do whatever you need to do here
npm test  # this also compiles the TypeScript
cd ..
npm run benchmarks

License

TypeScript implementation © 2019 National Research Council Canada, © 2024 Eddie Antonio Santos. MIT Licensed.

The algorithm comes from UAX #29: Unicode Text Segmentation, an integral part of the Unicode Standard, version 15.1.

unicode-default-word-boundary's People

Contributors

dependabot[bot] avatar eddieantonio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

unicode-default-word-boundary's Issues

Move internal utility functions out of findBoundaries

Thanks for the latest performance improvements, that did a huge difference.

Doing a little profiling it seems that an extra 8-10% performance can be gained by just moving the:

  • positionAfter()
  • wordbreakPropertyAt()
  • isExtendedPictographicAt()
  • isAHLetter()
  • isMidNumLetQ()

... out of the findBoundaries() function so the utility functions don't have to be re-instantiated for every call to findBoundaries(). This is of cause only an issue when findBoundaries() is called many times.

The first three utility functions would then need to have text passed in as an argument.

side note:
moving rules like WB5 up the stack to just after WB2 also gives a little gain as there in most cases would be more AH-letters than emojis in a text. This is only 3-4% though (for WB5), so if it messes to much with the order and understanding of the code no need to consider it.

Expose findBoundaries

Thank you for great and performant module.

Could you expose findBoundaries() as part of the public API?

We are using the module for af fulltext search engine and it seems that we can gain some extra indexing performance if we could skip the LazySpan object and just slice the string directly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.