Comments (16)
I'm not opposed to that, but I'd prefer to try other ways to mitigate the size increase:
- dummy sort as suggested by Manish
- Optional vec, or smallvec
I see such optimization as a last resort as it moves away from nice semantic API toward compromise.
Is that ok with you?
from icu4x.
Even simpler would be to allow only a single variant subtag in LanguageIdentifier, and fail to parse when there is more than one. That would make the LanguageIdentifier data model dead simple. If users want more variant subtags, use the heavier Locale class.
I'm not opposed to looking at other solutions, but what are the use cases for multiple variant subtags that we care about supporting? Right now we are compromising code size and complexity for correctness. Who actually cares about the correctness who needs to use LanguageIdentifier and can't use Locale?
Using an optional vec doesn't solve the problem, because as long as a vec could be used, we still need to carry the extra code.
from icu4x.
Is lossy optimization needed at this point? What benchmarks are we trying to beat?
from icu4x.
Is lossy optimization needed at this point?
What I'm trying to say is that "lossy optimization" is only lossy if there are use cases that break.
For example, instead of thinking about this as removing a feature, think about it in the other direction, as adding a feature: say we supported language-script-region with a single variant, and we wanted to add support for multiple variants. We would need to make the case that the use cases for that feature justify the increased code complexity.
What benchmarks are we trying to beat?
That's a good question. I don't have a perfect answer, other than the stats I previously posted in zbraniecki/unic-locale#49 suggesting that multiple variants doubled code size for LanguageIdentifier.
I also see it as "common sense": I can't think of any language tags I've seen in ICU and elsewhere that have multiple variants. Why add a feature that no one is going to use, whether or not it increases code complexity?
from icu4x.
@sffc I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing.
Implementing it here as a "Language Identifier but different" would be a last resort I think, if we run into actual trouble with code size.
My hope is that the Vec
is not actually what causes the size increase, and its the sorting/dedup which we can hopefully shrink without having to degrade our standard support.
from icu4x.
I think we should first explore having our own sort function
from icu4x.
@sffc I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing.
@macchiati Thoughts on this? Why does CLDR allow a variable number of variant subtags?
There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate. My instinct is that we shouldn't let an arcane part of the spec with no clear use cases complicate a low-level data structure like this. But, if we think spec compliance is more important, then so be it, and we can investigate other options.
I think we should first explore having our own sort function
I'm more concerned about the fact that LanguageIdentifier can require a heap allocation than I am about which sorting algorithm we use.
from icu4x.
from icu4x.
There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate.
My experience working with unic-langid
is that it works quite well - you don't heap-allocate unless you use Variants
. So for all locales we deal with during Firefox startup, all LanguageIdentifier
we never heap-allocate because we don't need it.
If you want to make sure your code doesn't, you could maybe wrap LanguageIdentifier
in a struct like LanguageIdentifierNonAllocating
which wouldn't allow for multiple variants?
For code size, I'm with @Manishearth and would prefer to start with just implementing dumb-sort.
from icu4x.
from icu4x.
FYI
BCP 47 spec for Variant Subtags: https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.5
Further down in https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1 Choice of Language Tag, guideline 6 says "Use variant subtags sparingly and in the correct order." which is not necessarily alphabetic order, while LDML says to sort variants in order to create a canonical form. It has two examples of language tags that are "correctly ordered" but are in reverse ASCII order.
@macchiati is this difference documented in the LDML spec? If not, what's the best way to address it?
Note that the second example in section 4.1 has three variant subtags: "sl-IT-rozaj-biske-1994" This is meaningful and real: See the entry for "Subtag: 1994" in the https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
https://www.rfc-editor.org/rfc/rfc5646.html#section-4.4.1 Working with Limited Buffer Sizes looks relevant. It says "Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of at least 35 characters." which provides for two variant subtags.
https://www.rfc-editor.org/rfc/rfc5646.html#appendix-A Examples of Language Tags includes "sl-rozaj-biske (San Giorgio dialect of Resian dialect of Slovenian)"
from icu4x.
http://www.unicode.org/reports/tr35/tr35.html#Canonical_Unicode_Locale_Identifiers "A unicode_locale_id has canonical syntax when: ... Any variants are in alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa) ..."
from icu4x.
unic-langid
sorts variants using alphabetical order
from icu4x.
https://unicode-org.atlassian.net/browse/CLDR-13729 "Canonical Unicode Locale Identifiers variant order is in conflict with BCP 47 guideline"
from icu4x.
This question is moot if we don't expose LanguageIdentifier as a public type. I will close this issue as obsolete, and we can follow up in #64.
from icu4x.
from icu4x.
Related Issues (20)
- C/C++ header paths
- Neo date formatter: Options<R> vs R::Options and same for .format HOT 8
- Decide on names for icu_datetime and icu_calendar errors over FFI HOT 1
- Should we adopt a more consistent naming convention for traits and marker types? HOT 28
- LocaleExpander and LocaleDirectionality should use `AsRef` for their type parameter
- GitHub Pages and GCS resource efficiency HOT 11
- Create a way to map a Locale to a default Currency HOT 6
- For Currency Long Formatting: check that the .json data is completed
- Refactor: Move `MeasureUnit` to a Separate Crate HOT 3
- Choose an appropriate name for `MeasureUnit` crate HOT 2
- Consider Adding Specific Formatter for Long Formatting in for currency formatter HOT 2
- Add return type to `Yoke::with_mut` and `Yokeable::transform_mut` HOT 6
- Suggestion: Use `PluralRules` instead of `Count` HOT 4
- Missing `other` Unit Pattern for `sd` Locale in the Data for Long Currency HOT 5
- Special ZeroMap for ULE keys and VarULE values HOT 7
- Better abstractions for splitting lengths out from VarZeroVec
- Improve handling of overlap patterns in semantic datetime
- We should support running ICU4X tests with JSON data HOT 6
- Add back postcard fingerprints.csv HOT 1
- Baked data is bigger than postcard data HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icu4x.