nightowl888 / icu4n Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 8.0 20.5 MB

International Components for Unicode for .NET

License: Apache License 2.0

C# 99.41% PowerShell 0.57% Shell 0.01% Batchfile 0.01%

breakiterator globalization hacktoberfest icu icu4j international normalization transliterator unicode

icu4n's People

Contributors

Stargazers

Watchers

Forkers

silentcc introfog bongohrtech jeme rclabo nikcio

icu4n's Issues

Determine a solution for embedded data so it doesn't make the NuGet package grow exponentially per target framework

If the data were somehow embedded in a shared assembly, we wouldn't need to have duplicates of the data inside each framework targeted binary. Instead of:

ICU4N.dll (net45 > 12MB)
ICU4N.dll (netstandard1.3 > 12MB)
ICU4N.dll (netstandard2.0 > 12MB)

We could have something like

ICU4N.dll (net45 > 1MB)
ICU4N.dll (netstandard1.3 > 1MB)
ICU4N.dll (netstandard2.0 > 1MB)

Data.dll (netstandard1.0 > 11MB)

Which scales much better for new target frameworks and saves tons of disk space. However, it is unclear what the best means of accomplishing that is, whether the DLL can/should be embedded into the NuGet package is a bit unclear.

Note also that using the name ICU4N.Resources.dll for the assembly name is not possible because .Resources.dll is reserved for another purpose.

Change namepsaces/project structure to match .NET

The port was originally done from ICU4J, whose namespaces were meant to match the JDK. Looking at the structure of ICU4C, its structure is completely different.

Many (or most) of the classes defined in com.ibm.icu.text would likely be in System.Globalization if they were in .NET. For example, the Collator class is a close match for System.Globalization.CompareInfo, so it should be moved to a Globalization namespace as well.

CompareInfo is also a property of CultureInfo, so we should probably look at moving the collator functionality (or at least part of it) into the main library.

Other classes, such as BreakIterator also seem like good candidates for the Globalization namespace. More analysis is needed to determine if anything really belongs in the Text namespace.

Keeping in Sync with ICU

Ideally, we would keep the project structure exactly the same as ICU4J so it is easy to port the diff between tags over to ICU4N file by file. However, if we take the approach the ICU team is using, there are simply headers in each file indicating which file(s) it is a port of so syncing is made easier. More thought needs to be given as to best do this, as ICU4N would best not be a complete line-by-line port of ICU4J because of the gaps in functionality between Java and .NET.

UCultureInfo.CurrentCulture throws StackOverflowException for Chinese (region-specific)

Getting UCultureInfo.CurrentCulture will throw a StackOverflowException if the current culture is any of the following: zh-CN, zh-HK, zh-MO, zh-SG, zh-TW.

Steps to reproduce

// If you're not running Windows in Chinese
Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("zh-TW");
var culture = UCultureInfo.CurrentCulture;

DOCs: Convert remaining Javadocs to C# documentation comments

We need this for Intellisense support and possibly later for generating API documentation.

BreakIterator.GetCharacterInstance() - results differ from ICU4J

I'm getting different results for a number of tests (9 out of 919, so not too bad...). The test names relate to XPath 4.0 tests in

https://github.com/qt4cg/qt4tests/blob/master/fn/graphemes.xml

graphemes-1172
Input: "a🏿👶" (U+0061 U+1F3FF U+1F476)
Java result: 2 strings, "a🏿", "👶" ((U+0061 U+1F3FF ; U+1F476)
C# result: 3 separate single-codepoint strings.

graphemes-1173
Input: ""a🏿👶‍🛑" (U+0061 U+1F3FF U+1F476 U+200D U+1F6D1)
Java result: 2 strings, "a🏿", "👶‍🛑" (U+0061 U+1F3FF ; U+1F476 U+200D U+1F6D1)
C# result: 3 strings of lengths 1, 1, 3 respectively

Other failures, not specifically analysed (happy to send the results if needed):

graphemes-1180
graphemes-1181
graphemes-1182
graphemes-1183
graphemes-1184
graphemes-1185
graphemes-1189

It's possible of course that it's a Unicode version issue.

API: Convert flag constants to [Flags] enums, where appropriate

Much analysis must be done to ensure that the APIs contain all possible enums so all supported flags parameters can be passed, possibly by adding parameters for multiple enums.

This has already been done on the Normalizer class, but there are still other classes that use an int for flags which can accept a wide range of options. We don't necessarily need to make this any more than cosmetic - the values of the [Flags] enum values can remain the same as the constants that are being used now.

.NET 7/.NET 8 MAU projects don't build if NuGet package referencing ICU4N.Resources is added: alleged same target path for e.g. zh-HK\ICU4N.resources.dll

I have found that .NET 7 or .NET 8 MAUI projects stop building with VS 2022 with errors like shown below as soon as I add SaxonCS as a NuGet package to the project; SaxonCS (12.4) references ICU4N.Resources 60.1.0-alpha.402.

My translation of the (German) error message VS is giving below (repeatedly) is "Resource data contains several files with the same target path"

Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-HK\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-HK\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-HK\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504		
Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-SG\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hans-SG\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-SG\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504		
Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-MO\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-MO\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-MO\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504		
Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-TW\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-TW\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-TW\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504

I am not sure what is causing this, some other .NET 8/VS 2022 project types like WPF build/work fine with SaxonCS referencing ICU4N.Resources 60.1.0-alpha.402.

Any idea whether that is something you can fix with ICU4N.Resources or whether I need to try to file that as a bug on .NET 7/8 MAUI/VS 2022?

Add IsVisualBasicIdentifier() and IsVisualBasicIdentifierPart() methods to UChar

Similar methods were part of the ICU4J implementation. The rules for how to implement these methods for VB are documented here.

Task: Modify methods to use ref or out parameters instead of arrays parameters and return values, where sensible

Java does not support ref or out parameters, so as a workaround methods were designed to accept or return array types of specific lengths (each element representing a specific value). These methods need to be analyzed and changed to use ref or out parameters, where appropriate.

Example

        /// <summary>
        /// Parse a single non-whitespace character '<paramref name="ch"/>', optionally
        /// preceded by whitespace.
        /// </summary>
        /// <param name="id">The string to be parsed.</param>
        /// <param name="pos">INPUT-OUTPUT parameter.  On input, pos[0] is the
        /// offset of the first character to be parsed.  On output, pos[0]
        /// is the index after the last parsed character.  If the parse
        /// fails, pos[0] will be unchanged.</param>
        /// <param name="ch">The non-whitespace character to be parsed.</param>
        /// <returns>true if '<paramref name="ch"/>' is seen preceded by zero or more
        /// whitespace characters.</returns>
        public static bool ParseChar(string id, int[] pos, char ch)
        {
            int start = pos[0];
            pos[0] = PatternProps.SkipWhiteSpace(id, pos[0]);
            if (pos[0] == id.Length ||
                    id[pos[0]] != ch)
            {
                pos[0] = start;
                return false;
            }
            ++pos[0];
            return true;
        }

Can be changed to:

        /// <summary>
        /// Parse a single non-whitespace character '<paramref name="ch"/>', optionally
        /// preceded by whitespace.
        /// </summary>
        /// <param name="id">The string to be parsed.</param>
        /// <param name="pos">INPUT-OUTPUT parameter.  On input, pos is the
        /// offset of the first character to be parsed.  On output, pos
        /// is the index after the last parsed character.  If the parse
        /// fails, pos will be unchanged.</param>
        /// <param name="ch">The non-whitespace character to be parsed.</param>
        /// <returns>true if '<paramref name="ch"/>' is seen preceded by zero or more
        /// whitespace characters.</returns>
        public static bool ParseChar(string id, ref int pos, char ch)
        {
            int start = pos;
            pos = PatternProps.SkipWhiteSpace(id, pos);
            if (pos == id.Length ||
                    id[pos] != ch)
            {
                pos = start;
                return false;
            }
            ++pos;
            return true;
        }

Be sure to update the documentation appropriately to reflect the changes.

Of course, for this example to compile, all callers of ParseChar() as well as the PatternProps.SkipWhiteSpace() method will need to be modified, as well.

To find the offending methods, I have created code analyzers in the lucenenet-codeanalysis-dev Visual Studio extension. Just install the extension and filter for the following:

LuceneDev1003 - Finds all methods that accept an array parameter (except for char[])
LuceneDev1004 - Finds all methods that return an array parameter (except for char[])

Problem with Collator.getInstance(), related to ICU4N.resources

I'm getting an exception with Collator.getInstance(). I'm running in Rider. My project has a nuget dependency on ICU4N 60.1.0-alpha.402. Note the reference to ICU4N.resources version=60.0.0.0.

I thought I previously had this working but I might have been mistaken, because the application has a fallback path that catches the exception.

(Note, it would be great to get an ICU4N version that isn't at alpha status, since this leads to build warnings).

Michael Kay

System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
---> System.TypeInitializationException: The type initializer for 'DotNetLocaleHelper' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Impl.ICUData' threw an exception.
---> System.IO.FileNotFoundException: Could not load file or assembly 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'. The system cannot find the file specified.

File name: 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'
at System.Reflection.RuntimeAssembly.InternalLoad(AssemblyName assemblyName, StackCrawlMark& stackMark, AssemblyLoadContext assemblyLoadContext, RuntimeAssembly requestingAssembly, Boolean throwOnFileNotFound)
at System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly(CultureInfo culture, Version version, Boolean throwOnFileNotFound)
at System.Reflection.RuntimeAssembly.GetSatelliteAssembly(CultureInfo culture, Version version)
at ICU4N.Impl.ICUData..cctor()
--- End of inner exception stack trace ---
at ICU4N.Impl.ICUData.GetLocaleIDFromResourceName(String resourceName)
at ICU4N.Impl.ICUData.GetStream(Assembly loader, String resourceName, Boolean required)
at ICU4N.Impl.ICUBinary.GetData(Assembly assembly, String resourceName, String itemPath, Boolean required)
at ICU4N.Impl.ICUBinary.GetData(Assembly assembly, String resourceName, String itemPath)
at ICU4N.Impl.ICUResourceBundleReader.<>c__DisplayClass35_0.b__0(ReaderCacheKey key)
at ICU4N.Impl.SoftCache2.<>c__DisplayClass1_1.<GetOrCreate>b__1() at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode)
at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy1.CreateValue()
at ICU4N.Impl.SoftCache2.GetOrCreate(TKey key, Func2 valueFactory)
at ICU4N.Impl.ICUResourceBundleReader.GetReader(String baseName, String localeID, Assembly root)
at ICU4N.Impl.ICUResourceBundle.CreateBundle(String baseName, String localeID, Assembly root)
at ICU4N.Impl.ICUResourceBundle.<>c__DisplayClass64_0.b__0(String key)
at ICU4N.Impl.SoftCache2.<>c__DisplayClass1_1.<GetOrCreate>b__1() at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode)
at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy1.CreateValue()
at ICU4N.Impl.SoftCache2.GetOrCreate(TKey key, Func2 valueFactory)
at ICU4N.Impl.ICUResourceBundle.InstantiateBundle(String baseName, String localeID, String defaultID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, String defaultID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, Boolean disableFallback)
at ICU4N.Util.UResourceBundle.<>c__DisplayClass22_0.b__0(String key)
at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func2 valueFactory)
at ICU4N.Util.UResourceBundle.GetRootType(String baseName, Assembly root)
at ICU4N.Util.UResourceBundle.InstantiateBundle(String baseName, String localeName, Assembly root, Boolean disableFallback)
at ICU4N.Impl.ICUResourceBundle.CreateUCultureList(String baseName, Assembly root)
at ICU4N.Impl.ICUResourceBundle.AvailEntry.<>c__DisplayClass7_0.b__0()
at System.Threading.LazyInitializer.EnsureInitializedCore[T](T& target, Func1 valueFactory) at ICU4N.Impl.ICUResourceBundle.AvailEntry.GetUCultureList(UCultureTypes types) at ICU4N.Impl.ICUResourceBundle.GetUCultures(String baseName, Assembly assembly, UCultureTypes types) at ICU4N.Impl.ICUResourceBundle.GetUCultures(UCultureTypes types) at ICU4N.Globalization.UCultureInfo.GetCultures(UCultureTypes types) at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.LoadNonGregorianDefaultCalendars() at System.Threading.LazyInitializer.EnsureInitializedCore[T](T& target, Boolean& initialized, Object& syncLock, Func1 valueFactory)
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.EnsureInitialized()
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.EnsureInitialized()
at ICU4N.Globalization.CultureInfoExtensions..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.CultureInfoExtensions.ToUCultureInfo(CultureInfo culture)
at ICU4N.Globalization.UCultureInfo.GetCurrentCulture()
at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
at ICU4N.Text.Collator.GetInstance()
at Saxon.Eej.expr.sort.UcaCollatorUsingIcu..ctor(String uri) in /Users/mike/GitHub/saxon13/build/cs/ee/Saxon/Eej/expr/sort/UcaCollatorUsingIcu.cs:line 22
The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
---> System.TypeInitializationException: The type initializer for 'DotNetLocaleHelper' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Impl.ICUData' threw an exception.
---> System.IO.FileNotFoundException: Could not load file or assembly 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'. The system cannot find the file specified.

API: De-nest publicly exposed nested enums, interfaces, and classes

In .NET, classes tend to be flattened into namespaces rather than nested like Russian dolls as they are in Java. We should try to avoid this, where practical.

Task: Change suffix of files generated by T4 templates from XXXExtension.cs to XXX.generated.cs

Unfortunately, the <Class Name>Extension.cs convention we chose for file naming conflicts with some existing ICU class names and is very easily confused with the <Class Name>Extensions.cs common convention for extension methods.

The .generated.cs suffix is in use by several projects and removes ambiguity about how the code files are derived. This can also help with attempting to "fix" the code file instead of the T4 template, the former of which will be overwritten the next time the template is run.

It would be best to wait until #40 is done before working on this task.

State of Port

What is the state of this ICUJ port and is a release scheduled?

Task: Update inline StringBuilder calls to use ValueStringBuilder, when supported

Several methods that are intended to be called by other methods (often in a tight loop) are set up to new up a StringBuilder at the beginning of the method and then return the string at the end. A StringBuilder instance allocates on the heap and so are the underlying array chunks that it manages, so doing this inline when processing strings causes excessive garbage collection. A better alternative is to use ValueStringBuilder with a stack allocated initial buffer (usually 32 chars will suffice). ValueStringBuilder is a ref struct that allocates on the stack. So replacing it means the entire operation occurs on the stack unless it runs out of buffer, in which case it will automatically move the chars over to the heap (from the array pool, so it can be reused across method calls).

Example

        public static string Escape(string s)
        {
            StringBuilder buf = new StringBuilder();
            for (int i = 0; i < s.Length;)
            {
                int c = Character.CodePointAt(s, i);
                i += UTF16.GetCharCount(c);
                if (c >= ' ' && c <= 0x007F)
                {
                    if (c == '\\')
                    {
                        buf.Append("\\\\"); // That is, "\\"
                    }
                    else
                    {
                        buf.Append((char)c);
                    }
                }
                else
                {
                    bool four = c <= 0xFFFF;
                    buf.Append(four ? "\\u" : "\\U");
                    buf.Append(Hex(c, four ? 4 : 8));
                }
            }
            return buf.ToString();
        }

Can be changed to:

        public static string Escape(string s)
        {
#if FEATURE_SPAN
            ValueStringBuilder buf = new ValueStringBuilder(stackalloc char[CharStackBufferSize]);
#else
            StringBuilder buf = new StringBuilder(s.Length);
#endif
            for (int i = 0; i < s.Length;)
            {
                int c = Character.CodePointAt(s, i);
                i += UTF16.GetCharCount(c);
                if (c >= ' ' && c <= 0x007F)
                {
                    if (c == '\\')
                    {
                        buf.Append("\\\\"); // That is, "\\"
                    }
                    else
                    {
                        buf.Append((char)c);
                    }
                }
                else
                {
                    bool four = c <= 0xFFFF;
                    buf.Append(four ? "\\u" : "\\U");
                    buf.Append(Hex(c, four ? 4 : 8));
                }
            }
            return buf.ToString();
        }

Note that ValueStringBuilder is disposable, but calling ToString() disposes it automatically. However, it is better to use .AsSpan() to keep the buffer on the stack as long as possible if the method does not return it as a string immediately. If ToString() is not called, it should be disposed by either putting it in a using block or wrapping it in a try/finally block to explicitly call Dispose().

The above code still heap allocates a string at the end (which is less than ideal), but at least we save the allocation of the StringBuilder and its underlying memory, which if done project-wide will be very impactful for performance and won't impact the API at all.

Note that we still need to conditionally compile with FEATURE_SPAN and use StringBuilder because net40 doesn't support spans. However, in this case we can pass s.Length to the constructor to pre-allocate the correct amount of heap.

This task depends on #54 and may also require #55.

Rename classes/interfaces/enums to conform with .NET conventions

A few violations:

Class names with acronyms should use Pascal casing for the acronym (i.e. IcuService instead of ICUService).
Abbreviations (rather than acronyms) should be eliminated (i.e. Properties rather than Props).
Need clarification on why some types are prefixed with U. Ideally, these would not be prefixed this way as well.

Unable to run Transliterator with `DOTNET_SYSTEM_GLOBALIZATION_INVARIANT="1"`

System.InvalidOperationException: Failed to compare two elements in the array.
 ---> System.TypeInitializationException: The type initializer for 'ICU4N.Text.Transliterator' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.UCultureInfo' threw an exception.
 ---> System.Globalization.CultureNotFoundException: Only the invariant culture is supported in globalization-invariant mode. See https://aka.ms/GlobalizationInvariantMode for more information. (Parameter 'name')
en is an invalid culture identifier.
   at System.Globalization.CultureInfo..ctor(String name, Boolean useUserOverride)
   at ICU4N.Globalization.UCultureInfo..cctor()
   --- End of inner exception stack trace ---
   at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, OpenType openType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, Boolean disableFallback)
   at ICU4N.Util.UResourceBundle.<>c__DisplayClass25_0.<GetRootType>b__0(String key)
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at ICU4N.Util.UResourceBundle.GetRootType(String baseName, Assembly root)
   at ICU4N.Util.UResourceBundle.InstantiateBundle(String baseName, String localeName, Assembly root, Boolean disableFallback)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(String baseName, String localeName, Assembly root, Boolean disableFallback)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(String baseName, String localeName, Assembly root)
   at ICU4N.Text.Transliterator..cctor()
   --- End of inner exception stack trace ---
   at ICU4N.Text.Transliterator.GetInstance(String id)

To my understanding, the transliterator should use its own culture data and run with Invariant mode?

Or am I wrong and this has to run with the culture data installed?

Failing Test: ICU4N.Dev.Test.Collate.CollationServiceTest::TestRegisterFactory()

ICU4N.Dev.Test.Collate.CollationServiceTest::TestRegisterFactory() is failing on Linux due to the Collator.GetDisplayName(UCultureInfo) method not returning the correct name for a custom UCultureInfo instance registered with Collator.GetDisplayName().

Impact

The test is known to fail on Ubuntu 18.0.4 and Ubuntu 20.0.4 and only on .NET 5 and higher. While this makes it highly likely the issue is related to the ICU integration in .NET, changing to NLS seems to have no effect on the problem.

More investigation is needed, but due to the fact this is only returning display text for a culture it doesn't seem to be a blocker for releases at this point.

ICU4N.Text.CollationElementIterator.Ignorable is misspelt "Ingorable"

The property that should be named ICU4N.Text.CollationElementIterator.Ignorable is misspelt Ingorable.

Determine a way to utilize .NET externally localized resources with ICU4N

ICU4N comes with many localized files already, however, they do not come in the format that .NET expects. Some analysis needs to be done to determine how to make this happen so we don't need to embed all resources inside of the main DLLs.

Task: Auto-generate T4 templates

We need to update our build to automatically generate the files that are template-based. Until now, we have been doing this manually.

There is an example of this here: https://github.com/libgit2/libgit2sharp/blob/0e7ec84e1a339e0215b71f85244dd06a06d82d61/LibGit2Sharp/LibGit2Sharp.csproj#L27-L28.

Complete UChar implementation

In Java, the UCharacter class has all of the same APIs as the java.lang.Character class. Therefore, in .NET, we need UChar to cover the same APIs (static methods) as System.Char.

We need more clarity on specifically how to design the culture-sensitivity part of this, since in Java these methods were designed to use the equivalent of CultureInfo.InvariantCulture and therefore are not aware of the ambient culture.

We also need tests to validate expected behaviors with the new methods.

Convert CharacterIterator into ICharacterEnumerator and move to J2N

CharacterIterator and classes that depend on it are the only classes in ICU4N.Support that are still marked as public. Ideally, when ICU4N is released, there will be no public facing ICU4N.Support namespace.

The CharacterIterator class should be converted into an ICharacterEnumerator interface and moved to J2N, and all implementations converted also.

This has been attempted and is partially completed, but due to the fact that CharacterIterator uses post-increment behavior, it doesn't work very well. It was fairly easy to get J2N.Text.StringCharacterEnumerator working with ICU4N, but not so for Lucene.NET.

The approach taken was to make a wrapper class so the ICharacterEnumerator could be passed in, and behind the scenes it would be wrapped by a class that implements CharacterIterator (which was made internal). This didn't work in the opposite direction when converting the rest of the CharacterIterator classes into ICharacterEnumerator instances that can be passed to implementations of the BreakIterator abstract class. I am sure it is possible, but more effort is required to work out how to make it behave correctly (being that iterators return a value, and enumerators return true/false and then a property must be read but in the case of CharacterIterator, the property must somehow be read before the call to MoveNext() or MovePrevious()).

Complete UCultureInfo Implementation

In #3, a UCutureInfo type was created to replace the ICU4J ULocale type. The main goals for doing this are:

To provide a similar API as the CultureInfo class
To provide a way to set the culture/default culture of the current thread
To fill gaps in behavior between ICU and the .NET platform

We managed to get a prototype in place, but there are still some remaining tasks and research to complete.

Known Issues

TwoLetterISOLanguageName/GetTwoLetterISOLanguageName returns null for invariant culture, which is different than the documented behavior of ICU4J (at least for the 3 letter language), which is to return empty string. In .NET, the behavior is to return the string "iv".
The Parent property behaves differently in than the .NET platform. In ICU, the script tag is not considered as part of the fallback behavior, but in .NET it is (i.e. uz_Cyrl_UZ falls back to uz_Cyrl in .NET, but in ICU it falls back to uz). The AcceptLanguage overloads of UCultureInfo depend on the current behavior for the tests to pass. For now, Parent has been marked internal until this can be addressed.
The default behavior of CurrentCulture and CurrentUICulture when not explicitly set is to track the properties of the CultureInfo.CurrentCulture and CultureInfo.CurrentUICulture. So, when either of the latter changes, the former are automatically updated to the nearest corresponding culture. If set explicitly, this tracking stops and the culture that is set is used instead. However, once set there is currently no way to "unset" UCultureInfo.CurrentCulture or UCultureInfo.CurrentUICulture to get back to the original tracking behavior.
The ULocale class in ICU4J was immutable and its Clone method simply returned itself. However, in .NET UCultureInfo was designed to be mutable unless it is wrapped using the ReadOnly method. For now, UCultureInfo is immutable and marked sealed so the behavior cannot be changed.
Since LCID is not available in the CLDR data and is deprecated, the property has been commented out and none of the method overloads or constructors that utilize it have been added.
The Name property effectively returns the base name in ICU4J. They are similar, however, in .NET they are typically delimited by hyphen and in ICU they are delimited by underscore. Through some limited testing, it appears both UCultureInfo and CultureInfo accept either format. More research is required to determine whether changing to the .NET convention makes sense.
Lack of caching. In ICU4J, static properties for the most commonly used cultures (to match the JDK). In .NET, there are no such properties, but there are methods available to provide read-only cached cultures.

Missing Members of `CultureInfo`

The following members of CultureInfo are not yet present in UCultureInfo.

Properties

Methods

public static UCultureInfo CreateSpecificCulture(string name)
public UCultureInfo GetConsoleFallbackUICulture()
public static UCultureInfo GetCultureInfo(string name)
public static UCultureInfo GetCultureInfo(int culture)
public static UCultureInfo GetCultureInfo(string name, string altName) - This one seems to be a duplicate of passing the keyword (i.e. @collation=phonebook), but in ICU4N, there is currently no culture cache.
public virtual object GetFormat(Type formatType) - Added in #46.
public static UCultureInfo ReadOnly(UCultureInfo ci) - Added in #46.

Missing Members of `ULocale`

Properties

CharacterOrientation - Marked internal, since the logical place to put it would be on TextInfo
LineOrientation - Marked internal, since the logical place to put it would be on TextInfo
IsRightToLeft - Marked internal, since the logical place to put it would be on TextInfo

API Documentation

Some of the JavaDocs have yet to be converted and other members (including the main class header) have not yet been documented.

`UCultureInfoBuilder`

UCultureInfoBuilder

ICU4J's ULocale class has a nested Builder class that has been de-nested and marked internal. Its purpose is to safely build a locale object while validating the inputs, as the UCultureInfo class provides no such validation. The CultureInfo class also doesn't provide validation upon creation, but will throw an exception if the requested culture doesn't exist on the platform.

More research is needed to determine if a similar function exists on the .NET platform so we can correctly map this functionality (which does exist on the Java platform) or whether it makes sense to keep it as is and add it to the public API.

Poor error message when version is incorrect

Calling VersionInfo.getInstance("unknown") produces the error message "Invalid version number: Version number may be negative or greater than 255". I suspect "may not" was intended (or perhaps "may" should be "might"?), but that doesn't exactly capture the error either...

Create Tests for TryGet versions of methods that were added to UScript

TryGet versions of Get methods were added to UScript in order to avoid using exceptions for control flow. However, we need tests for those methods to confirm functionality behaves as expected.

API: Rename numeric methods/properties/fields to conform with .NET conventions

For example, GetLong() should be converted to GetInt64(), GetFloat() to GetSingle(), etc.

Task: Add ReadOnlySpan<char> as a char sequence type to the T4 templates

We will need the ReadOnlySpan<char> overloads to optimize utility methods and avoid Substring() calls, which are extremely slow in .NET compared to Java. This requires updating the CodeGenerationSettings.xml file to include them and editing the T4 templates to correctly generate the overloads.

Methods that contain parameters for startIndex, endIndex, limit, length should omit these parameters and rely on the Slice() method of ReadOnlySpan<char> instead.

It would be best to wait until #40 is done before working on this task.

Finish MessageFormat implementation

The MessageFormat class is only partially implemented. It was actually only ported for the use of ChoiceFormat, which is required by Transliterator to load resources. But it currently doesn't work much beyond that purpose.

MessageFormat has many dependencies, including DateFormat and RuleBasedNumberFormat that would also need to be ported in order to make it complete.

Ideally, these format classes should implement ICustomFormatter and IFormatProvider to be compatible with string.Format() and other .NET APIs. Also, they should ideally not utilize CultureInfo as a property themselves, but be passed a CultureInfo instance when they do their work. That might not be feasible in all cases (such as rule-based number format). More analysis is required to work out the best approach for .NET compatibility.

Build: Add automation for building custom resource distributions

We have basic support for being able to exclude resource distributions, but it is mostly manual. We need some more automation for this to go smoothly.

Make the ICU4JResourceConverter tool into a dotnet tool. This is blocked by the fact we depend on the .NET Framework version of al.exe that isn't cross OS.
Automate the process of unpacking the icu4j.jar file from Maven, giving the user a choice between packing satellite assemblies or resource files.
Add better tooling for being able to deploy a class library that depends on a custom resource package.
a. We need to ensure the version of the custom NuGet package stays aligned with the original ICU4N build.
b. The end consumer should be able to override the custom resource NuGet package with either the default or another custom resource package. This can be done by packing a buildTransitive/<assemblyName>.targets file that has a PackageReference to the custom NuGet package.

Copying the ICU4N.targets file worked pretty well. We just need to automate the process of packing a copy of this file as buildTransitive/<assemblyName>.targets in the appropriate scenarios (giving the user the ability to opt in or out of this) when a custom distribution is involved and to reference the custom package name.

Add IsCSharpIdentifier() and IsCSharpIdentifierPart() methods to UChar

Similar methods were part of the ICU4J implementation. The rules for how to implement these methods for C# are documented here.

Task: Add ValueStringBuilder as an appendable type to the T4 templates

We will need the ValueStringBuilder overloads to optimize utility methods and avoid heap allocations in the main business logic. This requires updating the CodeGenerationSettings.xml file to include them and editing the T4 templates to correctly generate the overloads.

Note that ValueStringBuilder is a ref struct, so when passed into methods as a parameter, it must be passed as a ref argument. It is also declared internal, so all methods that are generated with it must also be internal.

Example

        public override IAppendable Normalize(string src, IAppendable dest)
        {
            try
            {
                return dest.Append(src);
            }
            catch (IOException e)
            {
                throw new ICUUncheckedIOException(e);  // Avoid declaring "throws IOException".
            }
        }

For the above method, an overload can be generated as:

        internal override void Normalize(string src, ref ValueStringBuilder dest)
        {
            try
            {
                return dest.Append(src);
            }
            catch (IOException e)
            {
                throw new ICUUncheckedIOException(e);  // Avoid declaring "throws IOException".
            }
        }

It would be best to wait until #40 is done before working on this task.

Remove ByteBuffer from public API

A JDK-style ByteBuffer class was ported from Apache Harmony to make porting the code from Java quicker. However, this class ended up in several public APIs.

Eventually, we should factor out ByteBuffer and create a set of extension methods on byte[] and other numeric types to make the same low-level conversions that ByteBuffer does. So, ByteBuffer should not be exposed on the public API, we should instead use byte[] when passing the data around.

Task: Add cross-OS command-line build script

Currently all building and testing is done on Azure DevOps. However, building and testing on the command line is undocumented.

While it is possible to build using dotnet build, dotnet pack, dotnet test and other commands, it would be simpler for potential contributors if there were a wrapper script to launch these commands.

The technology used for the build script is not that important as long as it runs cross-OS, but a way to get this done without adding any additional technology would simply be to make the script in MSBuild.

The build-pack-and-publish-libraries.yml file can be used as a template for the tasks in the build script.

However it is done, the README.md page should be updated with documentation on how to build/test from the command line.

Implement subclass of CultureInfo (UCultureInfo) to replace ULocale

This issue is blocking ICU4N from going to beta.

In Java, the ULocale class was created as a separate entity in ICU4J because:

The Java Locale class is sealed/final
In Java, the Locale is stored in a static field, not associated with the current thread like CultureInfo is in .NET

To make ICU4N compatible with .NET, a subclass of CultureInfo should be created (named UCultureInfo, for consistency) to replace ULocale.

This subclass should have all of the functionality of ULocale, but subclassing CultureInfo makes the class automatically compatible with the properties that allow it to be set to the current thread.

Compatibility

Gaps

There are several gaps in functionality between Java/.NET that need to be addressed. Including:

Conversion nn-NO-NY to/from nn-NO
NumberFormat differences (for example, in Java there is min and max number of decimals, but in .NET there is a single property indicating the number of decimals).

There may be others, more analysis is needed to get the complete list. Perhaps a conversation with the ICU team and/or Microsoft is required to work out how to account for these gaps.

Constructor

The constructor of UCultureInfo needs to accept both standard .NET cultures (in which case, we will just wrap the default culture), and it also should accept ICU-style parameters. A decision needs to be made whether to keep the ICU4N underscores (en_US@collation=phonebook), or to make it more like the .NET style (en-US@collation=phonebook). In the JDK, there are no underscores, the values are passed as separate parameters to the constructor of Locale, so there is no definite precedence to follow.

We may need to take the ICU data into consideration when making this decision.

Calendars

.NET actually has more calendars than ICU, so it is unclear exactly how to deal with this just yet. ICU is meant to be an "up to date" version of Unicode that is more recent than .NET, but we need to check whether the calendars in .NET are being kept up with the standard.

Conversion

Unlike Java, we won't need to convert UCultureInfo > CultureInfo as we are a subclass so UCultureInfo is a CultureInfo already.

To convert the other way, my thought is that we should use the same pattern as was done on the System.Type class in .NET.

var name = this.GetType().GetTypeInfo().Name;

So, we could do a similar thing with CultureInfo by creating an extension method:

var ucultureInfo = CultureInfo.CurrentCulture.GetUCultureInfo();

This would allow us to add additional properties to the subclass for use by ICU4N and/or its users.

Miscellaneous

There are probably additional things that need to be considered. More analysis is required.

API: [Obsolete] members

Much of the public API is marked [Obsolete]. It is unclear why these features are not removed upon some major release, but kept in and maintained indefinitely by the ICU team. We should probably contact them to get some clarity on why this was done, and what their decision would be on accessibility of those classes/members if they were building ICU today.

Most likely, the right choice for us is to make these members internal.

Create extension methods for common BreakIterator operations

While BreakIterator provides great low-level functionality for iterating forward and backward through breaks, it would be great if there were a simple way to do forward-only operations on string, StringBuilder, and char[].

IEnumerable<int> wordBreaks = theString.ToWordBreaks();
foreach (var break in wordBreaks)
{
    // consume
}

IEnumerable<int> sentenceBreaks = theString.ToSentenceBreaks(new CultureInfo("th"));
foreach (var break in sentenceBreaks)
{
    // consume
}

We would ideally create a different extension method (with overloads for optional culture) for all 4 modes:

Word
Sentence
Line
Character

We could then expand on this to do a higher level operation, such as providing an IEnumerable<string> that would tokenize the text so it can be iterated with a foreach loop.

foreach (var word in theText.ToWords(new CultureInfo("th-th")))
{
   // consume each word
}

Some thought needs to be given to thread safety, since BreakIterator requires a separate clone for each thread.

Spellout numbering

As far as I can tell, ICU4N doesn't include the spellout numbering capabilities of ICU4J.

I'm interested in assessing whether it's feasible to port this code and contribute it to the project. Having no familiarity with ICU-J internals, I wouldn't know where to start, but if you can provide any initial thoughts (perhaps you've looked at it and decided it's too hard...) then I'd appreciate any pointers.

Alternatively, rather than doing it ourselves we could sponsor the development.

Note, we are currently using ICU4N in the SaxonCS project for localised collation support.

Docs: Add documentation for disabling and making custom `ICU4N.Resources` distributions

#38 includes an ICU4N.targets file that contains configuration settings allow for custom distributions of resource data. These settings will need documentation once all of the configuration settings have been made user-configurable.

Determine and implement a solution for loading custom data

A core feature of the ICU project is to allow the end user to override the data in many ways (3 if I recall correctly). We need to research the best way to allow users to inject the data.

Unfortunately, a lot of time has passed since this issue was first noted until it was documented, so I don't recall all of the specific details. But the documentation is here.

This issue is blocking us from going to beta.

DOCs: Automate documentation generation

ICU has pretty good documentation already. However, many of the APIs were converted to be more .NET-like so we should probably at least have API docs.

We need to discuss with the ICU team whether it is within the realm of possibility for us to contribute ICU4N back to them to maintain. That would definitely factor into how this is done.

Convert Public Java Iterator classes to .NET Enumerators or Mark Internal

With the possible exception of BreakIterator and its subclasses, all public iterators should be converted to enumerators or (if not critical end user functionality) marked internal so they can be dealt with later.

Most of this work has already been completed, with the exception of iterators in the ICU4N.Collation assembly.

Finish UnicodeSet implementation

The following ISet<T> members are not yet implemented in UnicodeSet.

IsProperSubsetOf
IsProperSupersetOf
IsSubsetOf
SymmetricExceptWith

AFAIK, this is the only work left to do on UnicodeSet features, but should review to ensure that is the case.

Verify ConcurrentDictionary use, taking into account GetOrAdd can call creation callback more than once

See apache/lucenenet#417 for a more complete description.

This issue also applies to the GetOrCreate() method of SoftCache, which should be reviewed.

It is suspected that this may be the source of concurrency issues with ThaiTokenizer and ICUTokenizer tests of Lucene.NET, which are known to fail without extra locking that was not part of the original design.

API: Rename public enum values, constants, and static fields to conform to .NET conventions

For example:

public const SOME_VALUE = "theValue";

Should be converted to Pascal Case:

public const SomeValue = "theValue";

Add target for .NET 6/8 or .NET Standard 2.0/2.1

I'm using SaxonCS 12.4 which references ICU4N and ICU4N.Resources. The ICU4N.Resources NuGet package is targeting .NET Standard 1.6 which brings in a lot of out of date NuGet packages that contain security vulnerabilities and need updating. Any plans to update the NuGet package to add targets for .NET 6/8 or update the existing target to .NET Standard 2.0/2.1 so we can get rid of these old package references?

nightowl888 / icu4n Goto Github PK

icu4n's People

Contributors

Stargazers

Watchers

Forkers

icu4n's Issues

Keeping in Sync with ICU

Steps to reproduce

Example

Example

Impact

Known Issues

Missing Members of CultureInfo

Properties

Methods

Missing Members of ULocale

Properties

API Documentation

UCultureInfoBuilder

Example

Compatibility

Gaps

Constructor

Calendars

Conversion

Miscellaneous

Recommend Projects

Recommend Topics

Recommend Org

Missing Members of `CultureInfo`

Missing Members of `ULocale`

`UCultureInfoBuilder`