Giter Site home page Giter Site logo

unicode's Introduction

Convenient functions for working with unicode.

โš ๏ธ This package has only gone through limited testing. Make an issue when you hit a bug.

๐Ÿ‘€ examples

๐Ÿ“– documentation

Learning about Unicode

The string/unicode rabbit hole goes deep, we have a good overview (scroll to the unicode section).

unicode's People

Contributors

lukewilliamboswell avatar dilsonhiga avatar anton-4 avatar rtfeldman avatar ricardo-valero avatar ageron avatar bhansconnect avatar

Stargazers

 avatar Isaac Van Doren avatar  avatar ๅคง้–ขใ€€้‡‘ๅŸŽใ€€็ง€ๅ–œใ€€ใ‚ซใ‚ทใ‚ช avatar Alex Nikishkin avatar Tim Kersey avatar

Watchers

 avatar  avatar Alex Nikishkin avatar  avatar  avatar

unicode's Issues

Implement Visual Width

The Unicode Character Database UCD assigns to each Unicode character as its default width property one of six values: Ambiguous, Fullwidth, Halfwidth, Narrow, Wide, or Neutral (= Not East Asian). For any given operation, these six default property values resolve into only two property values, narrow and wide, depending on context.

zulip discussion

We already have a few examples that do this in our package, so this should be easy to implement as a good first issue.

Add the EastAsianWidth.txt data file to unicode/package/data, then write a InternalEAWGen.roc file that is almost a copy paste of InternalGBPGen.roc to parse the data file and generates a Roc file that maps CodePoints CP to an East Asian Width property EAW : [Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide], and then implement a corresponding helper that uses this to walk through a List U8 or a Str and sum of the width.

Grapheme.split fails on empty strings

Here's a little example to reproduce the issue:

app [main] {
    pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.12.0/Lb8EgiejTUzbggO2HVVuPJFkwvvsfW6LojkLR20kTVE.tar.br",
    unicode: "https://github.com/roc-lang/unicode/releases/download/0.1.1/-FQDoegpSfMS-a7B0noOnZQs3-A2aq9RSOR5VVLMePg.tar.br"
}

import pf.Task exposing [Task]
import pf.Stdout
import unicode.Grapheme

main =
    when Grapheme.split "" is
        Ok _ -> Stdout.line!("Ok")
        Err _ -> Stdout.line!("Err")

Imrpove grapheme.split testing

Quoted from Luke:

coverage of the unicode data file test points is pretty average, like it might only have a test that covers an emoji at the start of a string, but not the middle or end or before a CLRF or after a Hangul sequence... etc.
So I'm reasonably confident there are a couple of edge cases we haven't caught, and could end up crashing someone's code. It would be nice to get that to a point where we are reasonably confident that is not going to happen.

Grapheme.split function crashes

The Grapheme.split function crashes on some edge-cases, for example, running:

Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])

Crashes with the output:

The program crashed with:

        This is definitely a bug in the roc-lang/unicode package, caused by an unhandled edge case in grapheme text segmentation.

It is difficult to track down and catch every possible combination, so it would be helpful if you could log this as an issue with a reproduction.

Grapheme.split state machine state at the time was:
((AfterZWJ <opaque>), [8205, 4417], [ZWJ, L])

Here is the call stack that led to the crash:

        roc.panic
        Grapheme.splitHelp
        Grapheme.(anonymous function)
        Result.try
        Grapheme.split
        app.(anonymous function)
        Task.(anonymous function)
        .(anonymous function)
        rust.main

Optimizations can make this list inaccurate! If it looks wrong, try running without `--optimize` and with `--linker=legacy`

Here are a list of examples that crash this function:

Grapheme.split (Str.fromUtf8 [13, 204, 136, 225, 134, 168, 226, 128, 141, 234, 176, 129])
Grapheme.split (Str.fromUtf8 [224, 185, 131, 1, 225, 133, 160, 226, 128, 141, 224, 164, 128])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 31])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 204, 136, 205, 184])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 224, 164, 149])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 10])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 181, 142])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 204, 136, 240, 159, 135, 166])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 225, 134, 168])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 36])
Grapheme.split (Str.fromUtf8 [243, 160, 129, 174, 234, 176, 128, 226, 128, 141, 224, 164, 188])

They all contain U+200D the zero-width joiner character, so that's probably the source of the crash.

These examples were found by running the radamsa fuzzer using the examples in the GraphemeBreakTest data file. Hopefully this fuzz testing could be automated in the future as mentioned in #7.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.