The unic-langid code <a class="user-mention notranslate" data-hovercard-type="user" da

Is lossy optimization needed at this point? <p dir="aut

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Problems with multiple variant subtags (use cases?),about unicode-org/icu4x

zbraniecki commented on September 15, 2024

I'm not opposed to that, but I'd prefer to try other ways to mitigate the size increase:

dummy sort as suggested by Manish
Optional vec, or smallvec

I see such optimization as a last resort as it moves away from nice semantic API toward compromise.
Is that ok with you?

from icu4x.

sffc commented on September 15, 2024

Even simpler would be to allow only a single variant subtag in LanguageIdentifier, and fail to parse when there is more than one. That would make the LanguageIdentifier data model dead simple. If users want more variant subtags, use the heavier Locale class.

I'm not opposed to looking at other solutions, but what are the use cases for multiple variant subtags that we care about supporting? Right now we are compromising code size and complexity for correctness. Who actually cares about the correctness who needs to use LanguageIdentifier and can't use Locale?

Using an optional vec doesn't solve the problem, because as long as a vec could be used, we still need to carry the extra code.

from icu4x.

kpozin commented on September 15, 2024

Is lossy optimization needed at this point? What benchmarks are we trying to beat?

from icu4x.

sffc commented on September 15, 2024

Is lossy optimization needed at this point?

What I'm trying to say is that "lossy optimization" is only lossy if there are use cases that break.

For example, instead of thinking about this as removing a feature, think about it in the other direction, as adding a feature: say we supported language-script-region with a single variant, and we wanted to add support for multiple variants. We would need to make the case that the use cases for that feature justify the increased code complexity.

What benchmarks are we trying to beat?

That's a good question. I don't have a perfect answer, other than the stats I previously posted in zbraniecki/unic-locale#49 suggesting that multiple variants doubled code size for LanguageIdentifier.

I also see it as "common sense": I can't think of any language tags I've seen in ICU and elsewhere that have multiple variants. Why add a feature that no one is going to use, whether or not it increases code complexity?

from icu4x.

zbraniecki commented on September 15, 2024

@sffc I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing.

Implementing it here as a "Language Identifier but different" would be a last resort I think, if we run into actual trouble with code size.
My hope is that the Vec is not actually what causes the size increase, and its the sorting/dedup which we can hopefully shrink without having to degrade our standard support.

from icu4x.

Manishearth commented on September 15, 2024

I think we should first explore having our own sort function

from icu4x.

sffc commented on September 15, 2024

@sffc I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing.

@macchiati Thoughts on this? Why does CLDR allow a variable number of variant subtags?

There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate. My instinct is that we shouldn't let an arcane part of the spec with no clear use cases complicate a low-level data structure like this. But, if we think spec compliance is more important, then so be it, and we can investigate other options.

I think we should first explore having our own sort function

I'm more concerned about the fact that LanguageIdentifier can require a heap allocation than I am about which sorting algorithm we use.

from icu4x.

macchiati commented on September 15, 2024

One possibility is to store all variants combined into a single string (which is usually empty). If there are any variants when parsing: 1. separate the input variants (checking for wellformedness) 2. put in alphabetical order (normalized for comparison) 3. join into a single string with "-" The API can split the variants to hand back to the user as a set or list. Some extra performance cost using that API, but it is only rarely needed. Mark

…

On Tue, Apr 21, 2020 at 3:49 PM Shane F. Carr ***@***.***> wrote: @sffc <https://github.com/sffc> I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing. @macchiati <https://github.com/macchiati> Thoughts on this? Why does CLDR allow <http://cldr.unicode.org/core-spec#Identifiers> a variable number of variant subtags? There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate. My instinct is that we shouldn't let an arcane part of the spec with no clear use cases complicate a low-level data structure like this. But, if we think spec compliance is more important, then so be it, and we can investigate other options. I think we should first explore having our own sort function I'm more concerned about the fact that LanguageIdentifier can require a heap allocation than I am about which sorting algorithm we use. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMBFT7PQSH7QQUK6IZ3RNYPIXANCNFSM4MNSEGMQ> .

from icu4x.

zbraniecki commented on September 15, 2024

There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate.

My experience working with unic-langid is that it works quite well - you don't heap-allocate unless you use Variants. So for all locales we deal with during Firefox startup, all LanguageIdentifier we never heap-allocate because we don't need it.

If you want to make sure your code doesn't, you could maybe wrap LanguageIdentifier in a struct like LanguageIdentifierNonAllocating which wouldn't allow for multiple variants?

For code size, I'm with @Manishearth and would prefer to start with just implementing dumb-sort.

from icu4x.

macchiati commented on September 15, 2024

Sorry, I should have answered your question first, Shane. CLDR allows any number of variants because BCP47 does. BCP47 allows multiples because some of them are reasonable to combine productively. Mark

…

On Tue, Apr 21, 2020 at 3:49 PM Shane F. Carr ***@***.***> wrote: @sffc <https://github.com/sffc> I think I see it as conformance to the spec. If we don't think variants should be a list, we should work with Unicode to update the spec to allow for LanguageIdentifier with just a single variant allowed to be a thing. @macchiati <https://github.com/macchiati> Thoughts on this? Why does CLDR allow <http://cldr.unicode.org/core-spec#Identifiers> a variable number of variant subtags? There's an elegance to having LanguageIdentifier be "plain old data", with no chance of it ever needing to heap-allocate. My instinct is that we shouldn't let an arcane part of the spec with no clear use cases complicate a low-level data structure like this. But, if we think spec compliance is more important, then so be it, and we can investigate other options. I think we should first explore having our own sort function I'm more concerned about the fact that LanguageIdentifier can require a heap allocation than I am about which sorting algorithm we use. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMBFT7PQSH7QQUK6IZ3RNYPIXANCNFSM4MNSEGMQ> .

from icu4x.

markusicu commented on September 15, 2024

FYI

BCP 47 spec for Variant Subtags: https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.5

Further down in https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1 Choice of Language Tag, guideline 6 says "Use variant subtags sparingly and in the correct order." which is not necessarily alphabetic order, while LDML says to sort variants in order to create a canonical form. It has two examples of language tags that are "correctly ordered" but are in reverse ASCII order.

@macchiati is this difference documented in the LDML spec? If not, what's the best way to address it?

Note that the second example in section 4.1 has three variant subtags: "sl-IT-rozaj-biske-1994" This is meaningful and real: See the entry for "Subtag: 1994" in the https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

https://www.rfc-editor.org/rfc/rfc5646.html#section-4.4.1 Working with Limited Buffer Sizes looks relevant. It says "Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of at least 35 characters." which provides for two variant subtags.

https://www.rfc-editor.org/rfc/rfc5646.html#appendix-A Examples of Language Tags includes "sl-rozaj-biske (San Giorgio dialect of Resian dialect of Slovenian)"

from icu4x.

markusicu commented on September 15, 2024

http://www.unicode.org/reports/tr35/tr35.html#Canonical_Unicode_Locale_Identifiers "A unicode_locale_id has canonical syntax when: ... Any variants are in alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa) ..."

from icu4x.

zbraniecki commented on September 15, 2024

unic-langid sorts variants using alphabetical order

from icu4x.

markusicu commented on September 15, 2024

https://unicode-org.atlassian.net/browse/CLDR-13729 "Canonical Unicode Locale Identifiers variant order is in conflict with BCP 47 guideline"

from icu4x.

sffc commented on September 15, 2024

This question is moot if we don't expose LanguageIdentifier as a public type. I will close this issue as obsolete, and we can follow up in #64.

from icu4x.

macchiati commented on September 15, 2024

BTW, I disagree with many comments in https://unicode-org.atlassian.net/browse/CLDR-13729; added a note. Mark

…

On Wed, Apr 29, 2020 at 8:06 PM Shane F. Carr ***@***.***> wrote: Closed #52 <#52>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCIQ4GVMB7Q6KZDGEDRPDTL3ANCNFSM4MNSEGMQ> .

from icu4x.

Problems with multiple variant subtags (use cases?) about icu4x HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent