Giter Site home page Giter Site logo

UTF8 support? about wren HOT 16 CLOSED

pwoolcoc avatar pwoolcoc commented on July 28, 2024
UTF8 support?

from wren.

Comments (16)

ryanplusplus avatar ryanplusplus commented on July 28, 2024

FYI Lua v5.3 (currently in RC) has "basic" support for utf8:
http://www.lua.org/work/doc/manual.html

from wren.

munificent avatar munificent commented on July 28, 2024

Yes! My intent has always been for Wren to be based on UTF-8. I've taken a few baby steps there, but there's still a lot of work to do. I think it handles Wren source files being UTF-8 correctly, but the various string operations at runtime still need work.

The main tricky bit is, of course, handling indexing into a string when it uses UTF-8. We'll have to figure out what behavior we think users will want and how efficiently it can be implemented.

from wren.

pwoolcoc avatar pwoolcoc commented on July 28, 2024

how averse are you to dependencies? ICU is way too much, but there are smaller projects we could use (https://github.com/josephg/librope) that we might be able to leverage for help with string operations.

from wren.

edsrzf avatar edsrzf commented on July 28, 2024

The lexer doesn't currently allow UTF-8 identifiers, but that may be by design; it's unclear.

from wren.

munificent avatar munificent commented on July 28, 2024

how averse are you to dependencies?

For better or worse, very averse. I'm not opposed to them in general. I love code reuse. But Wren's charter is to be as minimal, lightweight, and easy to drop into a codebase as possible. Part of that means minimal dependencies.

ICU is probably 100x bigger than all of Wren. :)

I'm not planning to have the core library support any complex Unicode functionality (collation, etc.). For that stuff, users are better off bringing that functionality in themselves. My goal is just to make sure Wren's internal string operations can handle storing Unicode text, and that the operations that are provided don't do something dumb on non-ASCII strings.

I don't think an optimized rope implementation is needed either (though I do think ropes are super cool). Users could always roll their own in Wren if needed.

from wren.

munificent avatar munificent commented on July 28, 2024

The lexer doesn't currently allow UTF-8 identifiers, but that may be by design; it's unclear.

By design. From what I've heard from Java and JS folks, Unicode identifiers ended up being a security and maintainability headache. For what it's worth, Ruby only allows ASCII letters in identifiers and Matz and company are Japanese.

Lua apparently uses the current locale to decide which identifiers are valid, which I think means code may not be portable across machines. 😮

from wren.

munificent avatar munificent commented on July 28, 2024

There's still work left to do, but I made a bunch of progress on this:

  • Non-ASCII characters are allowed in string literals: c5e6795
  • The existing string methods that already work with UTF-8 (this is one of the really smart things about UTF-8) are now tested: a92e58c
  • String subscripting with a number index handles UTF-8 correctly: a5b00ce
  • Strings are iterable and iterate over their code points, not bytes: eb424f5

The missing pieces I know of are:

  • Subscripting a string with a range isn't UTF-8 savvy. This just needs some grunt work to fix. Lots of corner cases to handle.
  • The count getter on string is unclear. I think it should be split into countBytes (number of bytes) and countCodePoints (ugh, terrible names). The latter would be O(n) since it has to walk the string.

In general, users should be dissuaded from thinking about a string's "length". It probably doesn't mean anything practically useful most of the time. Instead, they should use the higher level methods on string whenever possible (iterating, indexOf, startsWith, etc.)

from wren.

kmarekspartz avatar kmarekspartz commented on July 28, 2024

This is a size issue, but couldn't a string hold an array of pointers to code points? Then the countCodePoints is the size of that array, and subscripting can be done by code point.

It may be useful to split String into ByteString and String classes...

from wren.

MarcoLizza avatar MarcoLizza commented on July 28, 2024

Could the count getter simply be split into size (in bytes) and length (in code-points)?

Getting the length and/or size of the strings is crucial in I/O (over network sockets, for example).

from wren.

pwoolcoc avatar pwoolcoc commented on July 28, 2024

ah, @munificent, you beat me to it! I have a branch almost done with almost everything you just committed. If you haven't started the Range subscripting yet, I volunteer to take that on.

from wren.

kmarekspartz avatar kmarekspartz commented on July 28, 2024

Could the count getter simply be split into size (in bytes) and length (in code-points)?

Those names might be ambiguous.

from wren.

MarcoLizza avatar MarcoLizza commented on July 28, 2024

Those names might be ambiguous.

I agree that, perhaps, count may be a better choice over length (also for consistency). But size sounds quite unambiguous, to me.

from wren.

munificent avatar munificent commented on July 28, 2024

This is a size issue, but couldn't a string hold an array of pointers to code points? Then the countCodePoints is the size of that array, and subscripting can be done by code point.

Pointers are (at least) 32 bits, so you'd be better off just using UTF-32 at that point. The problem is that that's a horrifically inefficient encoding for most real-world strings. A very large fraction of strings in programs are ASCII. This is true even in programs written for users of other languages, since many strings contain things like IDs and other "internal" stuff that are never shown to humans. Outside of that, the vast majority of strings fit in UTF-16. Until someone starts writing Wren programs that deal with Linear B or heiroglyphics, we'd never need more than 16 bits per character. So allocating 32 bits all the time is just super painful. It wastes memory and it slows things down because it increases cache misses.

UTF-8 is, I think, the best compromise. It's optimally small and fast for most strings. Quite small and fast for strings in modern languages, and doesn't fail under the full weight of Unicode.

The only thing you lose is direct indexing, but I think in practice that doesn't hurt much. For what it's worth, the approach I took here is exactly what Go does, and those guys have thought a lot about this (including being the ones to invent UTF-8 many moons ago).

Could the count getter simply be split into size (in bytes) and length (in code-points)?

That was my first thought. I too think size pretty naturally sounds like "in bytes". But I think this is too likely to confuse people. I want the names to be really unambiguous.

Also, by making the names longer and a bit more awkward to use, it discourages people from thinking about the length of their string, which is good. It's rare that solving a problem with strings should require thinking about their length.

ah, @munificent, you beat me to it! I have a branch almost done with almost everything you just committed.

Oh, crap! I'm so sorry. I was in the shower thinking about how I wanted to handle UTF-8 and I felt like everything clicked so I wanted to get it all implemented before I forgot. Now that I have more contributors (woo!), I need to think about coordinating more!

If you haven't started the Range subscripting yet, I volunteer to take that on.

Yes please!

Speaking of Go, Go also allows strings to be used as arbitrary byte buffers. That means they can contain any byte value, including zero and malformed UTF-8 sequences. That seems pretty useful to me. Now that strings internally store their length (thanks, @edsrzf!) we don't need them to be null-terminated. Something to consider.

from wren.

edsrzf avatar edsrzf commented on July 28, 2024

I considered removing the null terminator from strings, but decided against it since keeping it makes C interoperability easier.

from wren.

munificent avatar munificent commented on July 28, 2024

One step closer! String subscripting with ranges works again and is UTF-8 savvy. (To slice a range of raw bytes from a string, we'll add a range to support to the subscript operator on string.bytes.)

The last piece is fixing the length/count getters.

from wren.

munificent avatar munificent commented on July 28, 2024

OK, I think I have the count methods and the overall API figured out. See: fe14364

from wren.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.