Comments (16)
FYI Lua v5.3 (currently in RC) has "basic" support for utf8:
http://www.lua.org/work/doc/manual.html
from wren.
Yes! My intent has always been for Wren to be based on UTF-8. I've taken a few baby steps there, but there's still a lot of work to do. I think it handles Wren source files being UTF-8 correctly, but the various string operations at runtime still need work.
The main tricky bit is, of course, handling indexing into a string when it uses UTF-8. We'll have to figure out what behavior we think users will want and how efficiently it can be implemented.
from wren.
how averse are you to dependencies? ICU is way too much, but there are smaller projects we could use (https://github.com/josephg/librope) that we might be able to leverage for help with string operations.
from wren.
The lexer doesn't currently allow UTF-8 identifiers, but that may be by design; it's unclear.
from wren.
how averse are you to dependencies?
For better or worse, very averse. I'm not opposed to them in general. I love code reuse. But Wren's charter is to be as minimal, lightweight, and easy to drop into a codebase as possible. Part of that means minimal dependencies.
ICU is probably 100x bigger than all of Wren. :)
I'm not planning to have the core library support any complex Unicode functionality (collation, etc.). For that stuff, users are better off bringing that functionality in themselves. My goal is just to make sure Wren's internal string operations can handle storing Unicode text, and that the operations that are provided don't do something dumb on non-ASCII strings.
I don't think an optimized rope implementation is needed either (though I do think ropes are super cool). Users could always roll their own in Wren if needed.
from wren.
The lexer doesn't currently allow UTF-8 identifiers, but that may be by design; it's unclear.
By design. From what I've heard from Java and JS folks, Unicode identifiers ended up being a security and maintainability headache. For what it's worth, Ruby only allows ASCII letters in identifiers and Matz and company are Japanese.
Lua apparently uses the current locale to decide which identifiers are valid, which I think means code may not be portable across machines. 😮
from wren.
There's still work left to do, but I made a bunch of progress on this:
- Non-ASCII characters are allowed in string literals: c5e6795
- The existing string methods that already work with UTF-8 (this is one of the really smart things about UTF-8) are now tested: a92e58c
- String subscripting with a number index handles UTF-8 correctly: a5b00ce
- Strings are iterable and iterate over their code points, not bytes: eb424f5
The missing pieces I know of are:
- Subscripting a string with a range isn't UTF-8 savvy. This just needs some grunt work to fix. Lots of corner cases to handle.
- The
count
getter on string is unclear. I think it should be split intocountBytes
(number of bytes) andcountCodePoints
(ugh, terrible names). The latter would be O(n) since it has to walk the string.
In general, users should be dissuaded from thinking about a string's "length". It probably doesn't mean anything practically useful most of the time. Instead, they should use the higher level methods on string whenever possible (iterating, indexOf
, startsWith
, etc.)
from wren.
This is a size issue, but couldn't a string hold an array of pointers to code points? Then the countCodePoints is the size of that array, and subscripting can be done by code point.
It may be useful to split String into ByteString and String classes...
from wren.
Could the count
getter simply be split into size
(in bytes) and length
(in code-points)?
Getting the length and/or size of the strings is crucial in I/O (over network sockets, for example).
from wren.
ah, @munificent, you beat me to it! I have a branch almost done with almost everything you just committed. If you haven't started the Range subscripting yet, I volunteer to take that on.
from wren.
Could the count getter simply be split into size (in bytes) and length (in code-points)?
Those names might be ambiguous.
from wren.
Those names might be ambiguous.
I agree that, perhaps, count
may be a better choice over length
(also for consistency). But size
sounds quite unambiguous, to me.
from wren.
This is a size issue, but couldn't a string hold an array of pointers to code points? Then the countCodePoints is the size of that array, and subscripting can be done by code point.
Pointers are (at least) 32 bits, so you'd be better off just using UTF-32 at that point. The problem is that that's a horrifically inefficient encoding for most real-world strings. A very large fraction of strings in programs are ASCII. This is true even in programs written for users of other languages, since many strings contain things like IDs and other "internal" stuff that are never shown to humans. Outside of that, the vast majority of strings fit in UTF-16. Until someone starts writing Wren programs that deal with Linear B or heiroglyphics, we'd never need more than 16 bits per character. So allocating 32 bits all the time is just super painful. It wastes memory and it slows things down because it increases cache misses.
UTF-8 is, I think, the best compromise. It's optimally small and fast for most strings. Quite small and fast for strings in modern languages, and doesn't fail under the full weight of Unicode.
The only thing you lose is direct indexing, but I think in practice that doesn't hurt much. For what it's worth, the approach I took here is exactly what Go does, and those guys have thought a lot about this (including being the ones to invent UTF-8 many moons ago).
Could the count getter simply be split into size (in bytes) and length (in code-points)?
That was my first thought. I too think size
pretty naturally sounds like "in bytes". But I think this is too likely to confuse people. I want the names to be really unambiguous.
Also, by making the names longer and a bit more awkward to use, it discourages people from thinking about the length of their string, which is good. It's rare that solving a problem with strings should require thinking about their length.
ah, @munificent, you beat me to it! I have a branch almost done with almost everything you just committed.
Oh, crap! I'm so sorry. I was in the shower thinking about how I wanted to handle UTF-8 and I felt like everything clicked so I wanted to get it all implemented before I forgot. Now that I have more contributors (woo!), I need to think about coordinating more!
If you haven't started the Range subscripting yet, I volunteer to take that on.
Yes please!
Speaking of Go, Go also allows strings to be used as arbitrary byte buffers. That means they can contain any byte value, including zero and malformed UTF-8 sequences. That seems pretty useful to me. Now that strings internally store their length (thanks, @edsrzf!) we don't need them to be null-terminated. Something to consider.
from wren.
I considered removing the null terminator from strings, but decided against it since keeping it makes C interoperability easier.
from wren.
One step closer! String subscripting with ranges works again and is UTF-8 savvy. (To slice a range of raw bytes from a string, we'll add a range to support to the subscript operator on string.bytes
.)
The last piece is fixing the length/count getters.
from wren.
OK, I think I have the count methods and the overall API figured out. See: fe14364
from wren.
Related Issues (20)
- [RFC] Object.responds(_) method HOT 10
- [RFC] Add routines for degrees/radians conversions HOT 21
- [RFC] Adding a `Tuple` with language support HOT 11
- [RFC] Adding `const` versions of `Object`s. HOT 9
- Class reflection for embedding HOT 3
- [RFC] `veery` lang transpiler to `wren` lang HOT 3
- [RFC] Object method message passing syntax HOT 28
- [RFC] Add `static Object.typeOf(_)` (and deprecate `Object.type` ?) HOT 9
- How can i stop wren script running? HOT 3
- Calling wren method handle from inside a bound foreign method body HOT 5
- Whitespace bugs? HOT 4
- Serialize (suspended) fibers or serialize the vm HOT 4
- wren_debug.c should use vm->config.writeFn instead of printf HOT 1
- Should we document this aspect of for loop control variable behavior?
- How do I pass a foreign object to a function call? HOT 3
- is Wren dead? HOT 2
- Where causes code to be called twice HOT 3
- How to return other foreign class obj from a foreign class ? HOT 7
- wrenCall -> foreign call causes memory corruption HOT 11
- Compound Assignments HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wren.