unicode-org / icu4x Goto Github PK
View Code? Open in Web Editor NEWSolving i18n for client-side and resource-constrained environments.
Home Page: https://icu4x.unicode.org
License: Other
Solving i18n for client-side and resource-constrained environments.
Home Page: https://icu4x.unicode.org
License: Other
@hagbard wrote a doc with some constraints that WebAssembly brings that make it not a viable option for ICU4X porting to the Web Platform. We should:
There are comments on the Data Pipeline doc, since it already got merged. Would like to hear feedback to first determine if they merit changes.
definition of "data version" - This definition describes the version of data as something that is "abstracted away from" the versions of the format and schema. I would think that the relationship is actually dependent on the schema, but still independent of the format. Since schema is the structure of the data, if the schema version changes, I would expect it to force the data version to change.
Data version - To the extent that this matters or makes sense, it would be more readable if the keys delineate "key segments" differently from multi-word segments. CLDR_37_alpha1
and FOO_1_1
are parsed differently, whereas CLDR-37-alpha1
and FOO-1_1
would be unambiguous.
Schema version / Data version - If we allow the data provider to choose which version(s)' worth of data to hold, then it's possible for a user to call data for a key+version which is not supported (maybe the version is too old/new, or the key has changed due to schema change). Do we have a description of how we handle that? We could just make it easy and return null / throw error. I suppose a data provider can be configured to fetch from an authoritative service with all versions of all data (depicted in the diagram?), which makes it a data provider decision/configuration.
About https://github.com/unicode-org/omnicu/blob/master/docs/ecosystem.md
At the moment it seems that only one implementation per each icu::*
is listed. Any interest in listing other implementations, or just candidates for ICU4X inclusion?
Being compatible with #![no_std]
is important for running on low-resource devices. The benefits of no_std include:
-Z
can be used to compile standard library codeIn a no_std environment, we would still depend on the alloc
crate. A lightweight allocator such as wee_alloc can be used when necessary.
IDNA is in UTS 46, but it is a very specific subject area. The existing IDNA Rust crate lives alongside other URL-related functionality.
ICU's IDNA has had bugs piling up, with no clear owner, since IDNA is largely tangential to the core ICU functionality of localized text processing. UAX 31 and UTS 39 (identifiers and confusables) are largely in the same bucket.
I personally would like to see us focus on ECMA-402, and declare UAX 31, UTS 39, and UTS 46 out of scope of ICU4X, at least for the time being.
In order to reduce code size when shipping binaries to other environments, such as WebAssembly, it may be desirable to give a way to leverage the environment's standard library instead of building the Rust version into the OmnICU binary.
For example, JavaScript has a Map type. If we build OmnICU to target WebAssembly, it would be nice if OmnICU could import JavaScript Map instead of shipping hashbrown.
CC @kripken
The "X" part of ICU4X is still not fully defined. WebAssembly has been tossed out as one potential solution (main issue: #73). However, we should also investigate other approaches. One of those approaches could be treating Rust as a source language for transpilation.
Rust has certain advantages to serve as a source language for transpilation:
Obviously, the big hurdle is that we would have to figure out how to write this Rust transpiler, which, as far as I can tell, does not exist yet. For both Lisp and Rust, since these are uncommon languages at Google, it is uncertain whether we could find another team who would have the expertise to own the project.
However, given the strong community around Rust, I have a bit of hope that if we (at Google and Mozilla, two industry leaders) were to write such a tool, that it would obtain community adoption and be used in other projects.
What should be the directory structure for our monorepo project in Git?
We've agreed to start out with a monorepo. However, the question remains about whether we ship artifacts as one large crate or many small crates.
Would it be possible to foresee a common key space for data, where ICU4X data uses only a small part of? That would allow multiplexing data providers, something we'd like to explore in Fuchsia. This is more of an inquiry if something like this would be of interest for ICU4X rather than a requirement.
@hagbard has been drafting a style guide for OmnICU code in Rust. We should clean up the style guide and check it in to the docs/ folder as a Markdown file.
Spinning this off from #43.
Rust provides a number of commands that help universally work with each component. @nciric suggested to drop just listing those commands in README.md which I did, but I think it could be nice to have them documented.
Those commands are:
cargo test
cargo bench
cargo doc --open
cargo build --release
cargo fmt
cargo clippy
Based on pull request #28 I would like to discuss ways we can deal with data providers and end client API (in document referred as ergonomic API).
I feel that average developer shouldn't care where the data comes from, but should be aware of async nature of the request, as long as project as a whole can set it up for them. Think about Chrome, where Browser/Renderer processes set up data to be fetched from disk or if missing from a service. Ordinary developer wouldn't need to make that decision on every point of interaction with our API.
A similar approach to what @zbraniecki proposed for caching can be applied to data providers. We can have a simple DataProviderCache object, that's globally available to all constructors/methods. I don't expect that a single instance of our library will have more than handful different providers (if that), so cache would be fairly small.
An example of DataProviderCache initialization:
data_provider_cache = DataProviderCache()
data_provider_cache.insert('static_data', static_provider[, preference_level_0])
data_provider_cache.insert('aws_data', aws_provider[, preference_level_1])
data_provider_cache.insert('slow_data', slow_provider[, preference_level_2])
...
Preference level was added in case two providers can supply the same data set, but potentially with higher cost to speed, dollar amount etc.
Each data provider would know which locale it can handle, and data it can provide for each. It would also be able to tell if it already has that data so new fetch is not necessary.
Our ergonomic API in that case would be in a shape of:
Intl.NumberFormat(locale, options)
or if we want to enable developers to enforce specific data sources:
Intl.NumberFormat(locale, options, ['static_data', 'aws_data'])
We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments?
One of the features of unic-locale
is that it allows for "free" encoding of language identifiers, locales and subtags thanks to proc macros.
In the current model it's quite quirky, and I didn't want to include it in the initial landing, but for that code to work, we only need to add two methods per subtag, and at least one of them doesn't have to be public - they just need to make it possible to create a subtag from a pre-computed u64/u32 (which is what the proc macro will do).
The good news is that Rust is about to stabilize proc macros (targeting Rust 1.45) which will allow us to get this feature without multi-crate hacks required before.
This issue a companion of: unicode-org/rust-discuss#14.
How interested are we in general in having a common set of trait-specified API surfaces to program against?
The idea is to allow multiple implementations, say omnicu (this work) and based on icu4c (like https://github.com/google/rust_icu). The theory is great, but practice is proving to be a bit more difficult than I thought.
Trying my hand at this I found that rust puts constraints on how the traits can look like, specifically since it is involved to return some useful forms of iterator.
Here's my attempt to work this out for https://github.com/zbraniecki/unic-locale/tree/master/unic-langid. For example, having a trait that returns an iterator trait (e.g. ExactSizeIterator
) seems quite complicated because of the need to arrange the correct lifetimes of the iterator, the iterated-over-objects themselves, and the owner of the iterated-over objects. I got to this, but I'm not very happy about the outcome: https://github.com/unicode-org/rust-discuss/pull/19/files
FYI @zbraniecki
There are a few conventions for how to name a locale variable. The two I've seen the most are:
locale
loc
Which convention should we adopt in ICU4X, e.g., in argument names?
I've seen both kebab case and snake case used in crate names.
Which convention do we want to adopt?
icu-locale
icu_locale
The discussion in #40 regarding implementation touches on performance of existing ICU4C code vs. the pre-existing Rust module unicode-normalization.
Performance results enables comparison between implementations, which enables a decision on future implementation strategy. Performance results can be useful in general for its own sake as a measure of Rust code across changes, independent of Rust vs C comparisons.
Some of the aspects of performance testing:
cargo bench
provides a way of running benchmarks for running test codeThis is a follow-up to #43 (comment).
There are some compelling arguments that we should expose only Locale and not LanguageIdentifier as public API in ICU4X. This issue is a reminder to revisit this discussion once the rest of unic-locale is rolled in and we are able to perform more testing.
ICU4C/ICU4J is built on top of a large, internal standard library of low-level data structures. We will likely want at least a subset of those in OmnICU.
My question is, which ones do we need in order to support the ECMA-402 featureset?
Examples:
CC @markusicu
How do we name our crates? This depends a bit on the answer to #13, but we have two general options:
@zbraniecki had some arguments for preferring "icu". Can you lay those out here?
Filing this so I don't forget.
Since README.md
is, at least for the time being, the main landing page for this project, consider
adding links to files in the docs/
dir so that they are easily accessible.
I find having a continuous code coverage to be very useful in finding missing spots in test coverage.
I only ever used coveralls
, but it works quite well. Here's an example for fluent-rs
.
I think we should set up some CI+codecov, either via github actions, travis or any other solution. tarpaulin
, which is a crate I'm using, also supports codecov
.
When we don't want to perform file or network I/O, it may be necessary to build the data into a binary format that can be built into the Rust static memory space.
We can use a solution like CBOR or bincode.
However, these solutions require parsing the data structure into Rust objects at runtime. That should be pretty quick, and strings buffers could point into the static memory, but a pure Rust static structure would likely be the fastest. However, I have not yet done code size or performance testing.
The Bylaws state that official decisions for design docs should be made at meetings with consensus.
In order to better stick to that process, we can have a Github PR bot that checks if reviewers from at least 2 member companies have reviewed the PR. We can reuse or extend a similar Github PR bot used for ICU PRs.
In data-pipeline.md, I explain the need for a data provider trait, with an open question being exactly what the definition of this trait should be. The purpose of this thread is to reach a cohesive solution for that trait.
Requirements list (evolving):
One part of the larger testing strategy for ICU4X would be to have a clean, consistent way of organizing the unit tests for the business logic (i18n algorithms). In particular, it would be nice to have a data-oriented style of testing, as exemplified already in some parts of @zbraniecki's unic-locale repo. ICU unit tests tend to be written in a parameterized style, but the idea here is to take the data-driven nature further.
Pros:
Cons:
Most searches for "data driven testing" produce results for databases, spreadsheets, and automated web UI testing. Links to more relevant pre-existing libraries are welcome.
Some examples of test libraries written to reduce the cognitive load when testing, especially when testing data collections:
Beyond just asserting that the actual return value matches the provided expected value, we should also consider the following testing aspects:
One of the questions we had from clients was how hard is it to use library through FFI.
I played around with C++, Rust & cbindgen, based on Rust FFI Omnibus and other sites.
Here are results in my experimental repo - files of interest are main.cc, lib.rs.
I could document findings so far if people feel it's useful. We can expand it over time.
RACI is described here: https://en.wikipedia.org/wiki/Responsibility_assignment_matrix
tl;dr: this allows a distinction between people who are accountable for the outcome, and people who actually do the work (can be the same as accountable, but not necessary); as well as calling out explicitly the folks who can provide info vs people who are informed only.
https://github.com/unicode-org/icu4x/blob/master/docs/triaging.md#assignee talks about a "champion". This would be "accountable". Using prior art in responsibility assingment allows us not to spend time reinventing that.
Complementary to #1.
In case it is useful, we could consider using an IDL to auto-generate libraries from a common core implementation.
Pros:
Cons:
@nciric has a Google Doc with a table laying out some differences between these two approaches for porting OmnICU to other programming languages. The doc should be migrated to Markdown in this repository.
In the #43 I was able to improve the parsing performance by ~24% by separating length measuring and returning Err
before I parsed the subtag into a TinyStr
.
The reason it works so well is because in multiple places we use parsing to test if a subtag is of a given type, but in most cases we can learn it just from the length of the subtag.
A potential optimization may be to either separate length measuring in the parser (so, take the length, check if it's 2..=4
and if not don't even try to parse as X) or even add an internal constructor that takes a length (from_bytes_with_length
) so that you measure the length of the subtag once.
Depending on the cost of constructing the Result::Err
and the cost of taking the len
of a subtag (we do this per from_bytes
) we may get additional wins there.
One approach to accomplish the main goals in the OmnICU Charter is to have a transpiler that converts input source code into the equivalent source code for each target language (or platform).
The intent of this approach is to decouple the code deliverable from the target language toolchain and allow the toolchain to optimize code on a per-application basis. The result would allow the target code to run with minimal dependencies and code size.
The input source code should represent the i18n functionality that is in the scope for OmnICU. It should also be able to pass unit tests that check the logical correctness of the input code itself, and do so before any transpilation occurs. This ensures an authoritative source of truth for logical correctness tests, decoupled from target language/runtime effects.
My current WIP for the data provider (#61) passes around abstract hunks of data (Any
in Rust), which can be cast to specific structs using dynamic type checking at runtime.
let request = datap::Request {
locale: "root".to_string(),
category: datap::Category::Decimal,
key: datap::Key::Decimal(Key::SymbolsV1),
payload: None,
};
let response = my_data_provider.load(request).unwrap();
let decimal_data = response.borrow_payload::<&SymbolsV1>().unwrap();
My question is: where does the struct definition SymbolsV1
live in code? It could live:
I'm doing Option 2 in #61. I noticed Elango is doing Option 1 in #86. Because of the extensibility argument, I have a slight preference for Option 1, but I fear it could get unwieldy for data provider implementations.
We should think about how we test different feature sets and architectures. By default, cargo test
only tests your default architecture and the crate's default features.
Examples of things we want to test:
std
vs. no_std
environment (by enabling or disabling the std
feature)Note: rust-lang/cargo#2911 is a feature request to allow integration tests to choose different feature sets.
The CLDR collation order for Japanese includes a chunk of data that's just Level 1 Kanji followed by Level 2 Kanji from JIS X 0208. (Starts with &[last regular]<*亜
)
CLDR includes alternative collation orders for Chinese, gb2312han
(starts with &[last regular]<*啊
) and big5han
(starts with &[last regular]<*兙
) that appear to be Level 1 Hanzi followed by Level 2 Hanzi from GB2312 and Big5, respectively.
We might want to consider whether it makes sense as a binary size optimization (depending on what data layout the collator needs and whether the gb2312han
and big5han
orders are in actual use) to provide a cargo option on ICU4X collator and a cargo option on encoding_rs to use the data that already exists in the data segment of apps that depend on encoding_rs for constructing (the relevant parts of) these collation orders.
(This is just "writing this down". I don't expect us to act on this anytime soon.)
Our data loading involves reading files on demand, and sometimes even requesting data from a REST provider over HTTP.
Is the plan to make all the APIs asynchronous? Or to have two versions of each method?
CC: @sffc @zbraniecki
The way I run CI for my projects is via Travis.
I have an account there, connected my project to it and set up a .travis.yml
file.
Here's the view of fluent-rs
CI: https://travis-ci.org/github/projectfluent/fluent-rs
I'm not sure what other options exist beyond Travis and if we want to use it. @echeran ?
@zbraniecki has graciously offered to migrate https://github.com/zbraniecki/unic-locale to this repository and make it the core locale type for ICU4X. This issue is for tracking the progress of this migration.
I volunteered as Chair; @zbraniecki and @nciric volunteered as Vice-Chairs. Document this.
I very often see clients who want to use ICU as a default behavior, but fall back to custom logic if ICU does not support a given locale.
The main problem, of course, is that the locale fallback chain is an essential piece of whether or not a locale is supported. If you have locale data for "en" and "en_001", but request "en_US" or "en_GB", the answer is that both of those locales are supported, even though they both load their data from a fallback locale.
I'm not 100% confident, but I think the prevailing use case is that programmers want to know whether the locale falls back all the way to root. If it gets "caught" by an intermediate language, then that's fine, as long as we don't use the stub data in root.
ECMA-402 has the concept of supportedLocalesOf. Although it's not clear in the MDN, it appears that this method has the ability to check for locale fallbacks. This is better than ICU4C's behavior of getAvailableLocales, which returns a string list and requires the user to figure out how to do fallbacks and matching on that list.
We could consider whether this use case fits in with the data provider, or whether we want to put it on APIs directly.
The unic-langid code @zbraniecki is importing supports multiple variant subtags, as is required by the BCP47 standard. However, I had previously found that the requirement to store and sort a variable-length list of subtags doubles the code size of unic-langid (zbraniecki/unic-locale#49).
Claim: Most language tags don't have variant subtags, and of the ones that do, they usually only have one. I don't have data to back up this claim.
Given that (1) unic-langid is very low-level and (2) multiple variant subtags are uncommon (a claim which could be refuted by evidence), I was thinking about changing the data model:
Vec
to store the variant subtags, store them as a TinyStrAuto or TinyStr16*
"variant1-variant2-variant3"
, in alphabetical orderWe could still have helper methods to split out the different variants, but the data model would be significantly lighter weight, especially if we use TinyStr16 and reject language codes with too many variants.
To be clear, this would only need to affect LanguageIdentifier, not Locale, since Locale needs to carry the extra machinery in order to handle Unicode extension subtags.
ecosystem.md mentions icu::Regex
. The Rust regex
crate already exists and is very performant (in part due to not supporting some Perl-popularized features that aren't actually regular and hinder performance).
It might be useful to signal intent in this area at some point.
Does the project seek to provide regular expressions that operate on UTF-8 for Rust apps? If so, what would be the elevator pitch relative to the regex
crate?
Does the project seek to provide regular expressions that operate on UTF-16 and Latin1 and conform to ECMAScript regular expressions for use in JavaScript engines? If so, what would be the elevator pitch relative to what SpiderMonkey and V8 already have?
Does the project seek to provide regular expressions that Dart or Go programs would use? If so, what would be the elevator pitch relative to what the standard libraries of these languages provide?
Does the project seek to provide regular expressions that C or C++ apps would use via FFI? If so, would this just FFI around the regex
crate (i.e. UTF-8), something new, or for UTF-16?
@markusicu has done a great deal of work on ICU4C's normalizer. It depends on low-level and highly optimized data structures such as UCPTrie.
Writing normalization code from a clean room would allow us to:
Current design:
https://github.com/unicode-org/omnicu/blob/master/docs/data-pipeline.md
Brainstorming doc (please comment):
https://docs.google.com/document/d/1s_DE6zH27yGNv7rcfZEL8K3Hd0F3eIwMEUmbr7qs3lM/edit#
I think that either I am completely misunderstanding what the intended use case for the design is, or it's worth rethinking some of it. Please comment on the doc.
Locale component is currently owned by me and I get a lot of Rust specific help from Manish who has been my reviewer for this codebase.
Per today's conversation, it would be good to have a second person to co-own the component with me from the implementation/API etc. point of view.
I don't think it's blocking or urgent, but I'd like to file this issue so that we make sure to get the proper ownership coverage per component as we move.
The charter currently says:
OmnICU will provide an ECMA-402-compatible API surface in the target client-side platforms
and:
What if clients need a feature that is not in ECMA-402?
Clients of OmnICU may need features beyond those recommended by ECMA-402. The subcommittee is not ruling out the option of adding additional features in the same style as ECMA-402 to cover additional client needs. The details for how to determine what features belong in OmnICU that aren't already in ECMA-402 will be discussed at a future time.
The caption in ecosystem.md says:
This document tracks the crates that already exist in the ecosystem that cover functionality that we may wish to cover in OmnICU.
I added the word "may" in #41.
I think it's important that we be more explicit about the use cases that ICU4X is supporting. This will guide our discussions, such as #43 (comment), when deciding whether a certain API or functional unit belongs in ICU4X.
We could start by making an explicit list of use cases that warrant APIs and functional units not covered by 402, and adding that list to the charter. It might be best to do this on a case-by-case basis: if proposing a feature not explicitly sanctioned by the charter, then propose a change to the charter adding the corresponding use case to the charter, such that we can agree on that change in the subcommittee meeting.
At the moment PartialEq<&str>
relies on to_string
of the LanguageIdentifier
, but as we can see from benchmarks, parsing is much faster than serializing.
Therefore it might be nice to do the reverse - start parsing the &str
and compare subtags as we go.
The nice thing about this is that if we encounter mismatching tags early on, we can stop parsing.
Things we should cover in a linting CI:
cargo fmt
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.