Giter Site home page Giter Site logo

Comments (16)

zbraniecki avatar zbraniecki commented on September 15, 2024

I'm not opposed to that, but I'd prefer to try other ways to mitigate the size increase:

  • dummy sort as suggested by Manish
  • Optional vec, or smallvec

I see such optimization as a last resort as it moves away from nice semantic API toward compromise.
Is that ok with you?

from icu4x.

sffc avatar sffc commented on September 15, 2024

Even simpler would be to allow only a single variant subtag in LanguageIdentifier, and fail to parse when there is more than one. That would make the LanguageIdentifier data model dead simple. If users want more variant subtags, use the heavier Locale class.

I'm not opposed to looking at other solutions, but what are the use cases for multiple variant subtags that we care about supporting? Right now we are compromising code size and complexity for correctness. Who actually cares about the correctness who needs to use LanguageIdentifier and can't use Locale?

Using an optional vec doesn't solve the problem, because as long as a vec could be used, we still need to carry the extra code.

from icu4x.

kpozin avatar kpozin commented on September 15, 2024

Is lossy optimization needed at this point? What benchmarks are we trying to beat?

from icu4x.

sffc avatar sffc commented on September 15, 2024

Is lossy optimization needed at this point?

What I'm trying to say is that "lossy optimization" is only lossy if there are use cases that break.

For example, instead of thinking about this as removing a feature, think about it in the other direction, as adding a feature: say we supported language-script-region with a single variant, and we wanted to add support for multiple variants. We would need to make the case that the use cases for that feature justify the increased code complexity.

What benchmarks are we trying to beat?

That's a good question. I don't have a perfect answer, other than the stats I previously posted in zbraniecki/unic-locale#49 suggesting that multiple variants doubled code size for LanguageIdentifier.

I also see it as "common sense": I can't think of any language tags I've seen in ICU and elsewhere that have multiple variants. Why add a feature that no one is going to use, whether or not it increases code complexity?

from icu4x.

zbraniecki avatar zbraniecki commented on September 15, 2024

@sffc I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing.

Implementing it here as a "Language Identifier but different" would be a last resort I think, if we run into actual trouble with code size.
My hope is that the Vec is not actually what causes the size increase, and its the sorting/dedup which we can hopefully shrink without having to degrade our standard support.

from icu4x.

Manishearth avatar Manishearth commented on September 15, 2024

I think we should first explore having our own sort function

from icu4x.

sffc avatar sffc commented on September 15, 2024

@sffc I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing.

@macchiati Thoughts on this? Why does CLDR allow a variable number of variant subtags?

There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate. My instinct is that we shouldn't let an arcane part of the spec with no clear use cases complicate a low-level data structure like this. But, if we think spec compliance is more important, then so be it, and we can investigate other options.

I think we should first explore having our own sort function

I'm more concerned about the fact that LanguageIdentifier can require a heap allocation than I am about which sorting algorithm we use.

from icu4x.

macchiati avatar macchiati commented on September 15, 2024

from icu4x.

zbraniecki avatar zbraniecki commented on September 15, 2024

There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate.

My experience working with unic-langid is that it works quite well - you don't heap-allocate unless you use Variants. So for all locales we deal with during Firefox startup, all LanguageIdentifier we never heap-allocate because we don't need it.

If you want to make sure your code doesn't, you could maybe wrap LanguageIdentifier in a struct like LanguageIdentifierNonAllocating which wouldn't allow for multiple variants?

For code size, I'm with @Manishearth and would prefer to start with just implementing dumb-sort.

from icu4x.

macchiati avatar macchiati commented on September 15, 2024

from icu4x.

markusicu avatar markusicu commented on September 15, 2024

FYI

BCP 47 spec for Variant Subtags: https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.5

Further down in https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1 Choice of Language Tag, guideline 6 says "Use variant subtags sparingly and in the correct order." which is not necessarily alphabetic order, while LDML says to sort variants in order to create a canonical form. It has two examples of language tags that are "correctly ordered" but are in reverse ASCII order.

@macchiati is this difference documented in the LDML spec? If not, what's the best way to address it?

Note that the second example in section 4.1 has three variant subtags: "sl-IT-rozaj-biske-1994" This is meaningful and real: See the entry for "Subtag: 1994" in the https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

https://www.rfc-editor.org/rfc/rfc5646.html#section-4.4.1 Working with Limited Buffer Sizes looks relevant. It says "Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of at least 35 characters." which provides for two variant subtags.

https://www.rfc-editor.org/rfc/rfc5646.html#appendix-A Examples of Language Tags includes "sl-rozaj-biske (San Giorgio dialect of Resian dialect of Slovenian)"

from icu4x.

markusicu avatar markusicu commented on September 15, 2024

http://www.unicode.org/reports/tr35/tr35.html#Canonical_Unicode_Locale_Identifiers "A unicode_locale_id has canonical syntax when: ... Any variants are in alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa) ..."

from icu4x.

zbraniecki avatar zbraniecki commented on September 15, 2024

unic-langid sorts variants using alphabetical order

from icu4x.

markusicu avatar markusicu commented on September 15, 2024

https://unicode-org.atlassian.net/browse/CLDR-13729 "Canonical Unicode Locale Identifiers variant order is in conflict with BCP 47 guideline"

from icu4x.

sffc avatar sffc commented on September 15, 2024

This question is moot if we don't expose LanguageIdentifier as a public type. I will close this issue as obsolete, and we can follow up in #64.

from icu4x.

macchiati avatar macchiati commented on September 15, 2024

from icu4x.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.