Giter Site home page Giter Site logo

cto-af / linebreak Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 1.0 705 KB

Full Unicode 15.1 Line Breaking Algorithm, conformant with UAX #14.

Home Page: https://cto-af.github.io/linebreak/

License: MIT License

JavaScript 100.00%
conformant line-break line-breaking uax14 unicode word-wrap

linebreak's Introduction

@cto.af/linebreak

An implementation of the Unicode Line Breaking Algorithm UAX #14. This implementation was originally started as a refresh of the linebreak package, and still shares a small amount of test driver code with that project. The rest has been rewritten to support a fully rules-based approach that implements UAX #14 from Unicode version 15.0. From that document:

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. The Unicode Line Breaking Algorithm performs part of this process. Given an input text, it produces a set of positions called "break opportunities" that are appropriate points to begin a new line. The selection of actual line break positions from the set of break opportunities is not covered by the Unicode Line Breaking Algorithm, but is in the domain of higher level software with knowledge of the available width and the display size of the text.

Installation

npm install @cto.af/linebreak

API

Create and use a new Rules object:

import {Rules} from '@cto.af/linebreak'
const r = new Rules({string: true});
for (const brk of r.breaks('my input string')) {
  console.log(brk.string); // "my ", "input ", "string"
  console.log(brk.pos); // 3, 9, 15
  console.log(brk.required); // false, false, true
}

The string option in the constructor will chop the input up for you into strings, rather than your having to do the slicing yourself. You may only need the positions of the breaks, which is why this isn't done by default. The iterated Break objects also have a required field.

You can tailor the rules that will be applied:

import {Rules, PASS} from '@cto.af/linebreak'
const r = new Rules();
r.replaceRule('LB25', (state) => PASS); // Do something more interesting that this!

There are a few other convenience function available for modifying rules. A few of the rules have interactions with one another due to idiosyncrasies of the specification text. Comments have been left at these points in the source. If you are going to replace or remove an existing rule, please make sure to account for those interactions.

In order for the conformance tests to pass, you can use the expanded number definition from UAX #14, Example 7:

const r = new Rules({example7: true});

API Documentation

Full API documentation is available.

Conformance to UAX #14

This package intends to be fully conformant with UAX #14. It currently passes ALL of the tests published by Unicode, when the example7 option is enabled in the costructor.

Other tailoring is possible by adding and removing rules.

License

MIT


Tests codecov

linebreak's People

Contributors

hildjj avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

valadaptive

linebreak's Issues

Fix one Unicode 15.1 test

The test on line 10287 of LineBreakTest.txt doesn't make sense to me:

× 1B18 ÷ 1B27 × 1B44 × 200C × 1B2B × 1B38 ÷ 1B31 × 1B44 × 1B1D × 1B36 ÷ # × [0.3] BALINESE LETTER CA (AK) ÷ [999.0] BALINESE LETTER PA (AK) × [28.12] BALINESE ADEG ADEG (VI) × [9.0] ZERO WIDTH NON-JOINER (CM1_CM) × [28.13] BALINESE LETTER MA (AK) × [9.0] BALINESE VOWEL SIGN SUKU (CM1_CM) ÷ [999.0] BALINESE LETTER SA SAPA (AK) × [28.12] BALINESE ADEG ADEG (VI) × [28.13] BALINESE LETTER TA LATIK (AK) × [9.0] BALINESE VOWEL SIGN ULU (CM1_CM) ÷ [0.3]

In particular, AK × VI × ZWJ × AK is supposed to match 28.13, which is:

28.13) ($AK | $DottedCircle | $AS) $VI × ($AK | $DottedCircle)

The ZWJ gets turned into VI by rule LB9. Therefore, when rule 28.13 runs, it sees AK VI VI AK, which doesn't match.

Note that LB9 says:

Do not break a combining character sequence; treat it as if it has the line breaking class of the base character in all of the following rules. Treat ZWJ as if it were CM.

LB9 does NOT say "treat it as if it is coalesced into the base character".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.