Giter Site home page Giter Site logo

Add Unicode Support about heh HOT 1 CLOSED

ndd7xv avatar ndd7xv commented on May 27, 2024
Add Unicode Support

from heh.

Comments (1)

0b11001111 avatar 0b11001111 commented on May 27, 2024

I used the weekend to dive a bit into the topic and it turns out this is a bit more complicated than ASCII support -- surprise :D

A few things I've noticed:

  • Unicode is not the encoding. There are different encodings for Unicode like UTF-8 or UTF-16
  • The mentioned encodings are variable length encodings which means one character in those encodings may (read will for UTF16) occupy more than one byte
  • This makes the mapping of byte x to character y non trivial, unless you decode character by character

I came up with two basic approaches to tackle the issue.

  1. Use something like String::from_utf8_lossy, put the whole line/paragraph in it and accept the ragged margin you get as an result as well as decoding errors at the endpoints of your byte slice. Haven't really followed this idea but I don't think it promising.
  2. Decode character by character and here's how
    1. Given a slice of bytes, try parsing them from the start using str:from_utf8
      • on success: cool, you've just decoded the whole slice
      • on failure: the error contains information about how many bytes were parsed successfully until the error happened. We yield the successfully parsed substring along with it's offset in the slice.
      • increment the offset and repeat the procedure
    2. Now, we got a stream of successfully parsed substrings and their byte offsets. This can be turned into a stream of characters, each corresponding to a byte in the original slice.
      • a character in a substring simply becomes a character in the steam
        • if a character occupies more than one byte, it will be followed by size-1 dummy characters (need good Ideas which one to use here, currently it is '•')
      • b bytes outside the successfully parsed substrings get represented by a dummy character (just like now)

The latter sort of works but before seriously considering it, a few problems have to be sorted out:

  • Monospaced fonts work great for ASCII but 💩 looks wider in my terminal (and others may appear narrower)
  • There is a lot of weird stuff in unicode, especially further control codes, that needs to be tested and handled
  • How to deal with valid Unicode that cannot be displayed by your font?
  • Reliably detect, if the used terminal supports Unicode

Here's a screenshot of my progress:
grafik
(On the left: my modified version of heh, note the overflow in line one! On the right: original heh.)

If I find the time I'll polish the current state a bit and push it :) See #36

from heh.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.