Giter Site home page Giter Site logo

nightowl888 / icu4n Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 8.0 20.5 MB

International Components for Unicode for .NET

License: Apache License 2.0

C# 99.41% PowerShell 0.57% Shell 0.01% Batchfile 0.01%
breakiterator globalization hacktoberfest icu icu4j international normalization transliterator unicode

icu4n's People

Contributors

bongohrtech avatar introfog avatar nightowl888 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

icu4n's Issues

Determine a solution for embedded data so it doesn't make the NuGet package grow exponentially per target framework

If the data were somehow embedded in a shared assembly, we wouldn't need to have duplicates of the data inside each framework targeted binary. Instead of:

ICU4N.dll (net45 > 12MB)
ICU4N.dll (netstandard1.3 > 12MB)
ICU4N.dll (netstandard2.0 > 12MB)

We could have something like

ICU4N.dll (net45 > 1MB)
ICU4N.dll (netstandard1.3 > 1MB)
ICU4N.dll (netstandard2.0 > 1MB)

Data.dll (netstandard1.0 > 11MB)

Which scales much better for new target frameworks and saves tons of disk space. However, it is unclear what the best means of accomplishing that is, whether the DLL can/should be embedded into the NuGet package is a bit unclear.

Note also that using the name ICU4N.Resources.dll for the assembly name is not possible because .Resources.dll is reserved for another purpose.

Change namepsaces/project structure to match .NET

The port was originally done from ICU4J, whose namespaces were meant to match the JDK. Looking at the structure of ICU4C, its structure is completely different.

Many (or most) of the classes defined in com.ibm.icu.text would likely be in System.Globalization if they were in .NET. For example, the Collator class is a close match for System.Globalization.CompareInfo, so it should be moved to a Globalization namespace as well.

CompareInfo is also a property of CultureInfo, so we should probably look at moving the collator functionality (or at least part of it) into the main library.

Other classes, such as BreakIterator also seem like good candidates for the Globalization namespace. More analysis is needed to determine if anything really belongs in the Text namespace.

Keeping in Sync with ICU

Ideally, we would keep the project structure exactly the same as ICU4J so it is easy to port the diff between tags over to ICU4N file by file. However, if we take the approach the ICU team is using, there are simply headers in each file indicating which file(s) it is a port of so syncing is made easier. More thought needs to be given as to best do this, as ICU4N would best not be a complete line-by-line port of ICU4J because of the gaps in functionality between Java and .NET.

BreakIterator.GetCharacterInstance() - results differ from ICU4J

I'm getting different results for a number of tests (9 out of 919, so not too bad...). The test names relate to XPath 4.0 tests in

https://github.com/qt4cg/qt4tests/blob/master/fn/graphemes.xml

graphemes-1172
Input: "aπŸΏπŸ‘Ά" (U+0061 U+1F3FF U+1F476)
Java result: 2 strings, "a🏿", "πŸ‘Ά" ((U+0061 U+1F3FF ; U+1F476)
C# result: 3 separate single-codepoint strings.

graphemes-1173
Input: ""aπŸΏπŸ‘Άβ€πŸ›‘" (U+0061 U+1F3FF U+1F476 U+200D U+1F6D1)
Java result: 2 strings, "a🏿", "πŸ‘Άβ€πŸ›‘" (U+0061 U+1F3FF ; U+1F476 U+200D U+1F6D1)
C# result: 3 strings of lengths 1, 1, 3 respectively

Other failures, not specifically analysed (happy to send the results if needed):

graphemes-1180
graphemes-1181
graphemes-1182
graphemes-1183
graphemes-1184
graphemes-1185
graphemes-1189

It's possible of course that it's a Unicode version issue.

API: Convert flag constants to [Flags] enums, where appropriate

Much analysis must be done to ensure that the APIs contain all possible enums so all supported flags parameters can be passed, possibly by adding parameters for multiple enums.

This has already been done on the Normalizer class, but there are still other classes that use an int for flags which can accept a wide range of options. We don't necessarily need to make this any more than cosmetic - the values of the [Flags] enum values can remain the same as the constants that are being used now.

.NET 7/.NET 8 MAU projects don't build if NuGet package referencing ICU4N.Resources is added: alleged same target path for e.g. zh-HK\ICU4N.resources.dll

I have found that .NET 7 or .NET 8 MAUI projects stop building with VS 2022 with errors like shown below as soon as I add SaxonCS as a NuGet package to the project; SaxonCS (12.4) references ICU4N.Resources 60.1.0-alpha.402.

My translation of the (German) error message VS is giving below (repeatedly) is "Resource data contains several files with the same target path"

Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-HK\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-HK\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-HK\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504		
Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-SG\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hans-SG\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-SG\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504		
Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-MO\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-MO\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-MO\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504		
Fehler	APPX1101	Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-TW\ICU4N.resources.dll". Quelldateien: 
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-TW\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-TW\ICU4N.resources.dll	MauiNet7SaxonCS12Test1	C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets	1504

I am not sure what is causing this, some other .NET 8/VS 2022 project types like WPF build/work fine with SaxonCS referencing ICU4N.Resources 60.1.0-alpha.402.

Any idea whether that is something you can fix with ICU4N.Resources or whether I need to try to file that as a bug on .NET 7/8 MAUI/VS 2022?

Task: Modify methods to use ref or out parameters instead of arrays parameters and return values, where sensible

Java does not support ref or out parameters, so as a workaround methods were designed to accept or return array types of specific lengths (each element representing a specific value). These methods need to be analyzed and changed to use ref or out parameters, where appropriate.

Example

        /// <summary>
        /// Parse a single non-whitespace character '<paramref name="ch"/>', optionally
        /// preceded by whitespace.
        /// </summary>
        /// <param name="id">The string to be parsed.</param>
        /// <param name="pos">INPUT-OUTPUT parameter.  On input, pos[0] is the
        /// offset of the first character to be parsed.  On output, pos[0]
        /// is the index after the last parsed character.  If the parse
        /// fails, pos[0] will be unchanged.</param>
        /// <param name="ch">The non-whitespace character to be parsed.</param>
        /// <returns>true if '<paramref name="ch"/>' is seen preceded by zero or more
        /// whitespace characters.</returns>
        public static bool ParseChar(string id, int[] pos, char ch)
        {
            int start = pos[0];
            pos[0] = PatternProps.SkipWhiteSpace(id, pos[0]);
            if (pos[0] == id.Length ||
                    id[pos[0]] != ch)
            {
                pos[0] = start;
                return false;
            }
            ++pos[0];
            return true;
        }

Can be changed to:

        /// <summary>
        /// Parse a single non-whitespace character '<paramref name="ch"/>', optionally
        /// preceded by whitespace.
        /// </summary>
        /// <param name="id">The string to be parsed.</param>
        /// <param name="pos">INPUT-OUTPUT parameter.  On input, pos is the
        /// offset of the first character to be parsed.  On output, pos
        /// is the index after the last parsed character.  If the parse
        /// fails, pos will be unchanged.</param>
        /// <param name="ch">The non-whitespace character to be parsed.</param>
        /// <returns>true if '<paramref name="ch"/>' is seen preceded by zero or more
        /// whitespace characters.</returns>
        public static bool ParseChar(string id, ref int pos, char ch)
        {
            int start = pos;
            pos = PatternProps.SkipWhiteSpace(id, pos);
            if (pos == id.Length ||
                    id[pos] != ch)
            {
                pos = start;
                return false;
            }
            ++pos;
            return true;
        }

Be sure to update the documentation appropriately to reflect the changes.

Of course, for this example to compile, all callers of ParseChar() as well as the PatternProps.SkipWhiteSpace() method will need to be modified, as well.

To find the offending methods, I have created code analyzers in the lucenenet-codeanalysis-dev Visual Studio extension. Just install the extension and filter for the following:

  1. LuceneDev1003 - Finds all methods that accept an array parameter (except for char[])
  2. LuceneDev1004 - Finds all methods that return an array parameter (except for char[])

Problem with Collator.getInstance(), related to ICU4N.resources

I'm getting an exception with Collator.getInstance(). I'm running in Rider. My project has a nuget dependency on ICU4N 60.1.0-alpha.402. Note the reference to ICU4N.resources version=60.0.0.0.

I thought I previously had this working but I might have been mistaken, because the application has a fallback path that catches the exception.

(Note, it would be great to get an ICU4N version that isn't at alpha status, since this leads to build warnings).

Michael Kay

System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
---> System.TypeInitializationException: The type initializer for 'DotNetLocaleHelper' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Impl.ICUData' threw an exception.
---> System.IO.FileNotFoundException: Could not load file or assembly 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'. The system cannot find the file specified.

File name: 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'
at System.Reflection.RuntimeAssembly.InternalLoad(AssemblyName assemblyName, StackCrawlMark& stackMark, AssemblyLoadContext assemblyLoadContext, RuntimeAssembly requestingAssembly, Boolean throwOnFileNotFound)
at System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly(CultureInfo culture, Version version, Boolean throwOnFileNotFound)
at System.Reflection.RuntimeAssembly.GetSatelliteAssembly(CultureInfo culture, Version version)
at ICU4N.Impl.ICUData..cctor()
--- End of inner exception stack trace ---
at ICU4N.Impl.ICUData.GetLocaleIDFromResourceName(String resourceName)
at ICU4N.Impl.ICUData.GetStream(Assembly loader, String resourceName, Boolean required)
at ICU4N.Impl.ICUBinary.GetData(Assembly assembly, String resourceName, String itemPath, Boolean required)
at ICU4N.Impl.ICUBinary.GetData(Assembly assembly, String resourceName, String itemPath)
at ICU4N.Impl.ICUResourceBundleReader.<>c__DisplayClass35_0.b__0(ReaderCacheKey key)
at ICU4N.Impl.SoftCache2.<>c__DisplayClass1_1.<GetOrCreate>b__1() at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode)
at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy1.CreateValue()
at ICU4N.Impl.SoftCache2.GetOrCreate(TKey key, Func2 valueFactory)
at ICU4N.Impl.ICUResourceBundleReader.GetReader(String baseName, String localeID, Assembly root)
at ICU4N.Impl.ICUResourceBundle.CreateBundle(String baseName, String localeID, Assembly root)
at ICU4N.Impl.ICUResourceBundle.<>c__DisplayClass64_0.b__0(String key)
at ICU4N.Impl.SoftCache2.<>c__DisplayClass1_1.<GetOrCreate>b__1() at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode)
at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy1.CreateValue()
at ICU4N.Impl.SoftCache2.GetOrCreate(TKey key, Func2 valueFactory)
at ICU4N.Impl.ICUResourceBundle.InstantiateBundle(String baseName, String localeID, String defaultID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, String defaultID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, Boolean disableFallback)
at ICU4N.Util.UResourceBundle.<>c__DisplayClass22_0.b__0(String key)
at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func2 valueFactory)
at ICU4N.Util.UResourceBundle.GetRootType(String baseName, Assembly root)
at ICU4N.Util.UResourceBundle.InstantiateBundle(String baseName, String localeName, Assembly root, Boolean disableFallback)
at ICU4N.Impl.ICUResourceBundle.CreateUCultureList(String baseName, Assembly root)
at ICU4N.Impl.ICUResourceBundle.AvailEntry.<>c__DisplayClass7_0.b__0()
at System.Threading.LazyInitializer.EnsureInitializedCore[T](T& target, Func1 valueFactory) at ICU4N.Impl.ICUResourceBundle.AvailEntry.GetUCultureList(UCultureTypes types) at ICU4N.Impl.ICUResourceBundle.GetUCultures(String baseName, Assembly assembly, UCultureTypes types) at ICU4N.Impl.ICUResourceBundle.GetUCultures(UCultureTypes types) at ICU4N.Globalization.UCultureInfo.GetCultures(UCultureTypes types) at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.LoadNonGregorianDefaultCalendars() at System.Threading.LazyInitializer.EnsureInitializedCore[T](T& target, Boolean& initialized, Object& syncLock, Func1 valueFactory)
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.EnsureInitialized()
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.EnsureInitialized()
at ICU4N.Globalization.CultureInfoExtensions..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.CultureInfoExtensions.ToUCultureInfo(CultureInfo culture)
at ICU4N.Globalization.UCultureInfo.GetCurrentCulture()
at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
at ICU4N.Text.Collator.GetInstance()
at Saxon.Eej.expr.sort.UcaCollatorUsingIcu..ctor(String uri) in /Users/mike/GitHub/saxon13/build/cs/ee/Saxon/Eej/expr/sort/UcaCollatorUsingIcu.cs:line 22
The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
---> System.TypeInitializationException: The type initializer for 'DotNetLocaleHelper' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Impl.ICUData' threw an exception.
---> System.IO.FileNotFoundException: Could not load file or assembly 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'. The system cannot find the file specified.

Task: Change suffix of files generated by T4 templates from XXXExtension.cs to XXX.generated.cs

Unfortunately, the <Class Name>Extension.cs convention we chose for file naming conflicts with some existing ICU class names and is very easily confused with the <Class Name>Extensions.cs common convention for extension methods.

The .generated.cs suffix is in use by several projects and removes ambiguity about how the code files are derived. This can also help with attempting to "fix" the code file instead of the T4 template, the former of which will be overwritten the next time the template is run.

It would be best to wait until #40 is done before working on this task.

State of Port

What is the state of this ICUJ port and is a release scheduled?

Task: Update inline StringBuilder calls to use ValueStringBuilder, when supported

Several methods that are intended to be called by other methods (often in a tight loop) are set up to new up a StringBuilder at the beginning of the method and then return the string at the end. A StringBuilder instance allocates on the heap and so are the underlying array chunks that it manages, so doing this inline when processing strings causes excessive garbage collection. A better alternative is to use ValueStringBuilder with a stack allocated initial buffer (usually 32 chars will suffice). ValueStringBuilder is a ref struct that allocates on the stack. So replacing it means the entire operation occurs on the stack unless it runs out of buffer, in which case it will automatically move the chars over to the heap (from the array pool, so it can be reused across method calls).

Example

        public static string Escape(string s)
        {
            StringBuilder buf = new StringBuilder();
            for (int i = 0; i < s.Length;)
            {
                int c = Character.CodePointAt(s, i);
                i += UTF16.GetCharCount(c);
                if (c >= ' ' && c <= 0x007F)
                {
                    if (c == '\\')
                    {
                        buf.Append("\\\\"); // That is, "\\"
                    }
                    else
                    {
                        buf.Append((char)c);
                    }
                }
                else
                {
                    bool four = c <= 0xFFFF;
                    buf.Append(four ? "\\u" : "\\U");
                    buf.Append(Hex(c, four ? 4 : 8));
                }
            }
            return buf.ToString();
        }

Can be changed to:

        public static string Escape(string s)
        {
#if FEATURE_SPAN
            ValueStringBuilder buf = new ValueStringBuilder(stackalloc char[CharStackBufferSize]);
#else
            StringBuilder buf = new StringBuilder(s.Length);
#endif
            for (int i = 0; i < s.Length;)
            {
                int c = Character.CodePointAt(s, i);
                i += UTF16.GetCharCount(c);
                if (c >= ' ' && c <= 0x007F)
                {
                    if (c == '\\')
                    {
                        buf.Append("\\\\"); // That is, "\\"
                    }
                    else
                    {
                        buf.Append((char)c);
                    }
                }
                else
                {
                    bool four = c <= 0xFFFF;
                    buf.Append(four ? "\\u" : "\\U");
                    buf.Append(Hex(c, four ? 4 : 8));
                }
            }
            return buf.ToString();
        }

Note that ValueStringBuilder is disposable, but calling ToString() disposes it automatically. However, it is better to use .AsSpan() to keep the buffer on the stack as long as possible if the method does not return it as a string immediately. If ToString() is not called, it should be disposed by either putting it in a using block or wrapping it in a try/finally block to explicitly call Dispose().

The above code still heap allocates a string at the end (which is less than ideal), but at least we save the allocation of the StringBuilder and its underlying memory, which if done project-wide will be very impactful for performance and won't impact the API at all.

Note that we still need to conditionally compile with FEATURE_SPAN and use StringBuilder because net40 doesn't support spans. However, in this case we can pass s.Length to the constructor to pre-allocate the correct amount of heap.

This task depends on #54 and may also require #55.

Rename classes/interfaces/enums to conform with .NET conventions

A few violations:

  1. Class names with acronyms should use Pascal casing for the acronym (i.e. IcuService instead of ICUService).
  2. Abbreviations (rather than acronyms) should be eliminated (i.e. Properties rather than Props).
  3. Need clarification on why some types are prefixed with U. Ideally, these would not be prefixed this way as well.

Unable to run Transliterator with `DOTNET_SYSTEM_GLOBALIZATION_INVARIANT="1"`

System.InvalidOperationException: Failed to compare two elements in the array.
 ---> System.TypeInitializationException: The type initializer for 'ICU4N.Text.Transliterator' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.UCultureInfo' threw an exception.
 ---> System.Globalization.CultureNotFoundException: Only the invariant culture is supported in globalization-invariant mode. See https://aka.ms/GlobalizationInvariantMode for more information. (Parameter 'name')
en is an invalid culture identifier.
   at System.Globalization.CultureInfo..ctor(String name, Boolean useUserOverride)
   at ICU4N.Globalization.UCultureInfo..cctor()
   --- End of inner exception stack trace ---
   at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, OpenType openType)
   at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, Boolean disableFallback)
   at ICU4N.Util.UResourceBundle.<>c__DisplayClass25_0.<GetRootType>b__0(String key)
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at ICU4N.Util.UResourceBundle.GetRootType(String baseName, Assembly root)
   at ICU4N.Util.UResourceBundle.InstantiateBundle(String baseName, String localeName, Assembly root, Boolean disableFallback)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(String baseName, String localeName, Assembly root, Boolean disableFallback)
   at ICU4N.Util.UResourceBundle.GetBundleInstance(String baseName, String localeName, Assembly root)
   at ICU4N.Text.Transliterator..cctor()
   --- End of inner exception stack trace ---
   at ICU4N.Text.Transliterator.GetInstance(String id)

To my understanding, the transliterator should use its own culture data and run with Invariant mode?

Or am I wrong and this has to run with the culture data installed?

Failing Test: ICU4N.Dev.Test.Collate.CollationServiceTest::TestRegisterFactory()

ICU4N.Dev.Test.Collate.CollationServiceTest::TestRegisterFactory() is failing on Linux due to the Collator.GetDisplayName(UCultureInfo) method not returning the correct name for a custom UCultureInfo instance registered with Collator.GetDisplayName().

Impact

The test is known to fail on Ubuntu 18.0.4 and Ubuntu 20.0.4 and only on .NET 5 and higher. While this makes it highly likely the issue is related to the ICU integration in .NET, changing to NLS seems to have no effect on the problem.

More investigation is needed, but due to the fact this is only returning display text for a culture it doesn't seem to be a blocker for releases at this point.

Complete UChar implementation

In Java, the UCharacter class has all of the same APIs as the java.lang.Character class. Therefore, in .NET, we need UChar to cover the same APIs (static methods) as System.Char.

We need more clarity on specifically how to design the culture-sensitivity part of this, since in Java these methods were designed to use the equivalent of CultureInfo.InvariantCulture and therefore are not aware of the ambient culture.

We also need tests to validate expected behaviors with the new methods.

Convert CharacterIterator into ICharacterEnumerator and move to J2N

CharacterIterator and classes that depend on it are the only classes in ICU4N.Support that are still marked as public. Ideally, when ICU4N is released, there will be no public facing ICU4N.Support namespace.

The CharacterIterator class should be converted into an ICharacterEnumerator interface and moved to J2N, and all implementations converted also.

This has been attempted and is partially completed, but due to the fact that CharacterIterator uses post-increment behavior, it doesn't work very well. It was fairly easy to get J2N.Text.StringCharacterEnumerator working with ICU4N, but not so for Lucene.NET.

The approach taken was to make a wrapper class so the ICharacterEnumerator could be passed in, and behind the scenes it would be wrapped by a class that implements CharacterIterator (which was made internal). This didn't work in the opposite direction when converting the rest of the CharacterIterator classes into ICharacterEnumerator instances that can be passed to implementations of the BreakIterator abstract class. I am sure it is possible, but more effort is required to work out how to make it behave correctly (being that iterators return a value, and enumerators return true/false and then a property must be read but in the case of CharacterIterator, the property must somehow be read before the call to MoveNext() or MovePrevious()).

Complete UCultureInfo Implementation

In #3, a UCutureInfo type was created to replace the ICU4J ULocale type. The main goals for doing this are:

  1. To provide a similar API as the CultureInfo class
  2. To provide a way to set the culture/default culture of the current thread
  3. To fill gaps in behavior between ICU and the .NET platform

We managed to get a prototype in place, but there are still some remaining tasks and research to complete.

Known Issues

  • TwoLetterISOLanguageName/GetTwoLetterISOLanguageName returns null for invariant culture, which is different than the documented behavior of ICU4J (at least for the 3 letter language), which is to return empty string. In .NET, the behavior is to return the string "iv".
  • The Parent property behaves differently in than the .NET platform. In ICU, the script tag is not considered as part of the fallback behavior, but in .NET it is (i.e. uz_Cyrl_UZ falls back to uz_Cyrl in .NET, but in ICU it falls back to uz). The AcceptLanguage overloads of UCultureInfo depend on the current behavior for the tests to pass. For now, Parent has been marked internal until this can be addressed.
  • The default behavior of CurrentCulture and CurrentUICulture when not explicitly set is to track the properties of the CultureInfo.CurrentCulture and CultureInfo.CurrentUICulture. So, when either of the latter changes, the former are automatically updated to the nearest corresponding culture. If set explicitly, this tracking stops and the culture that is set is used instead. However, once set there is currently no way to "unset" UCultureInfo.CurrentCulture or UCultureInfo.CurrentUICulture to get back to the original tracking behavior.
  • The ULocale class in ICU4J was immutable and its Clone method simply returned itself. However, in .NET UCultureInfo was designed to be mutable unless it is wrapped using the ReadOnly method. For now, UCultureInfo is immutable and marked sealed so the behavior cannot be changed.
  • Since LCID is not available in the CLDR data and is deprecated, the property has been commented out and none of the method overloads or constructors that utilize it have been added.
  • The Name property effectively returns the base name in ICU4J. They are similar, however, in .NET they are typically delimited by hyphen and in ICU they are delimited by underscore. Through some limited testing, it appears both UCultureInfo and CultureInfo accept either format. More research is required to determine whether changing to the .NET convention makes sense.
  • Lack of caching. In ICU4J, static properties for the most commonly used cultures (to match the JDK). In .NET, there are no such properties, but there are methods available to provide read-only cached cultures.

Missing Members of CultureInfo

The following members of CultureInfo are not yet present in UCultureInfo.

Properties

Methods

Missing Members of ULocale

Properties

  • CharacterOrientation - Marked internal, since the logical place to put it would be on TextInfo
  • LineOrientation - Marked internal, since the logical place to put it would be on TextInfo
  • IsRightToLeft - Marked internal, since the logical place to put it would be on TextInfo

API Documentation

  • Some of the JavaDocs have yet to be converted and other members (including the main class header) have not yet been documented.

UCultureInfoBuilder

  • UCultureInfoBuilder

ICU4J's ULocale class has a nested Builder class that has been de-nested and marked internal. Its purpose is to safely build a locale object while validating the inputs, as the UCultureInfo class provides no such validation. The CultureInfo class also doesn't provide validation upon creation, but will throw an exception if the requested culture doesn't exist on the platform.

More research is needed to determine if a similar function exists on the .NET platform so we can correctly map this functionality (which does exist on the Java platform) or whether it makes sense to keep it as is and add it to the public API.

Poor error message when version is incorrect

Calling VersionInfo.getInstance("unknown") produces the error message "Invalid version number: Version number may be negative or greater than 255". I suspect "may not" was intended (or perhaps "may" should be "might"?), but that doesn't exactly capture the error either...

Task: Add ReadOnlySpan<char> as a char sequence type to the T4 templates

We will need the ReadOnlySpan<char> overloads to optimize utility methods and avoid Substring() calls, which are extremely slow in .NET compared to Java. This requires updating the CodeGenerationSettings.xml file to include them and editing the T4 templates to correctly generate the overloads.

  • Methods that contain parameters for startIndex, endIndex, limit, length should omit these parameters and rely on the Slice() method of ReadOnlySpan<char> instead.

It would be best to wait until #40 is done before working on this task.

Finish MessageFormat implementation

The MessageFormat class is only partially implemented. It was actually only ported for the use of ChoiceFormat, which is required by Transliterator to load resources. But it currently doesn't work much beyond that purpose.

MessageFormat has many dependencies, including DateFormat and RuleBasedNumberFormat that would also need to be ported in order to make it complete.

Ideally, these format classes should implement ICustomFormatter and IFormatProvider to be compatible with string.Format() and other .NET APIs. Also, they should ideally not utilize CultureInfo as a property themselves, but be passed a CultureInfo instance when they do their work. That might not be feasible in all cases (such as rule-based number format). More analysis is required to work out the best approach for .NET compatibility.

Build: Add automation for building custom resource distributions

We have basic support for being able to exclude resource distributions, but it is mostly manual. We need some more automation for this to go smoothly.

  1. Make the ICU4JResourceConverter tool into a dotnet tool. This is blocked by the fact we depend on the .NET Framework version of al.exe that isn't cross OS.
  2. Automate the process of unpacking the icu4j.jar file from Maven, giving the user a choice between packing satellite assemblies or resource files.
  3. Add better tooling for being able to deploy a class library that depends on a custom resource package.
    a. We need to ensure the version of the custom NuGet package stays aligned with the original ICU4N build.
    b. The end consumer should be able to override the custom resource NuGet package with either the default or another custom resource package. This can be done by packing a buildTransitive/<assemblyName>.targets file that has a PackageReference to the custom NuGet package.

Copying the ICU4N.targets file worked pretty well. We just need to automate the process of packing a copy of this file as buildTransitive/<assemblyName>.targets in the appropriate scenarios (giving the user the ability to opt in or out of this) when a custom distribution is involved and to reference the custom package name.

Task: Add ValueStringBuilder as an appendable type to the T4 templates

We will need the ValueStringBuilder overloads to optimize utility methods and avoid heap allocations in the main business logic. This requires updating the CodeGenerationSettings.xml file to include them and editing the T4 templates to correctly generate the overloads.

Note that ValueStringBuilder is a ref struct, so when passed into methods as a parameter, it must be passed as a ref argument. It is also declared internal, so all methods that are generated with it must also be internal.

Example

        public override IAppendable Normalize(string src, IAppendable dest)
        {
            try
            {
                return dest.Append(src);
            }
            catch (IOException e)
            {
                throw new ICUUncheckedIOException(e);  // Avoid declaring "throws IOException".
            }
        }

For the above method, an overload can be generated as:

        internal override void Normalize(string src, ref ValueStringBuilder dest)
        {
            try
            {
                return dest.Append(src);
            }
            catch (IOException e)
            {
                throw new ICUUncheckedIOException(e);  // Avoid declaring "throws IOException".
            }
        }

It would be best to wait until #40 is done before working on this task.

Remove ByteBuffer from public API

A JDK-style ByteBuffer class was ported from Apache Harmony to make porting the code from Java quicker. However, this class ended up in several public APIs.

Eventually, we should factor out ByteBuffer and create a set of extension methods on byte[] and other numeric types to make the same low-level conversions that ByteBuffer does. So, ByteBuffer should not be exposed on the public API, we should instead use byte[] when passing the data around.

Task: Add cross-OS command-line build script

Currently all building and testing is done on Azure DevOps. However, building and testing on the command line is undocumented.

While it is possible to build using dotnet build, dotnet pack, dotnet test and other commands, it would be simpler for potential contributors if there were a wrapper script to launch these commands.

The technology used for the build script is not that important as long as it runs cross-OS, but a way to get this done without adding any additional technology would simply be to make the script in MSBuild.

The build-pack-and-publish-libraries.yml file can be used as a template for the tasks in the build script.

However it is done, the README.md page should be updated with documentation on how to build/test from the command line.

Implement subclass of CultureInfo (UCultureInfo) to replace ULocale

This issue is blocking ICU4N from going to beta.

In Java, the ULocale class was created as a separate entity in ICU4J because:

  1. The Java Locale class is sealed/final
  2. In Java, the Locale is stored in a static field, not associated with the current thread like CultureInfo is in .NET

To make ICU4N compatible with .NET, a subclass of CultureInfo should be created (named UCultureInfo, for consistency) to replace ULocale.

This subclass should have all of the functionality of ULocale, but subclassing CultureInfo makes the class automatically compatible with the properties that allow it to be set to the current thread.

Compatibility

Gaps

There are several gaps in functionality between Java/.NET that need to be addressed. Including:

  1. Conversion nn-NO-NY to/from nn-NO
  2. NumberFormat differences (for example, in Java there is min and max number of decimals, but in .NET there is a single property indicating the number of decimals).

There may be others, more analysis is needed to get the complete list. Perhaps a conversation with the ICU team and/or Microsoft is required to work out how to account for these gaps.

Constructor

The constructor of UCultureInfo needs to accept both standard .NET cultures (in which case, we will just wrap the default culture), and it also should accept ICU-style parameters. A decision needs to be made whether to keep the ICU4N underscores (en_US@collation=phonebook), or to make it more like the .NET style (en-US@collation=phonebook). In the JDK, there are no underscores, the values are passed as separate parameters to the constructor of Locale, so there is no definite precedence to follow.

We may need to take the ICU data into consideration when making this decision.

Calendars

.NET actually has more calendars than ICU, so it is unclear exactly how to deal with this just yet. ICU is meant to be an "up to date" version of Unicode that is more recent than .NET, but we need to check whether the calendars in .NET are being kept up with the standard.

Conversion

Unlike Java, we won't need to convert UCultureInfo > CultureInfo as we are a subclass so UCultureInfo is a CultureInfo already.

To convert the other way, my thought is that we should use the same pattern as was done on the System.Type class in .NET.

var name = this.GetType().GetTypeInfo().Name;

So, we could do a similar thing with CultureInfo by creating an extension method:

var ucultureInfo = CultureInfo.CurrentCulture.GetUCultureInfo();

This would allow us to add additional properties to the subclass for use by ICU4N and/or its users.

Miscellaneous

There are probably additional things that need to be considered. More analysis is required.

API: [Obsolete] members

Much of the public API is marked [Obsolete]. It is unclear why these features are not removed upon some major release, but kept in and maintained indefinitely by the ICU team. We should probably contact them to get some clarity on why this was done, and what their decision would be on accessibility of those classes/members if they were building ICU today.

Most likely, the right choice for us is to make these members internal.

Create extension methods for common BreakIterator operations

While BreakIterator provides great low-level functionality for iterating forward and backward through breaks, it would be great if there were a simple way to do forward-only operations on string, StringBuilder, and char[].

IEnumerable<int> wordBreaks = theString.ToWordBreaks();
foreach (var break in wordBreaks)
{
    // consume
}

Or

IEnumerable<int> sentenceBreaks = theString.ToSentenceBreaks(new CultureInfo("th"));
foreach (var break in sentenceBreaks)
{
    // consume
}

We would ideally create a different extension method (with overloads for optional culture) for all 4 modes:

  1. Word
  2. Sentence
  3. Line
  4. Character

We could then expand on this to do a higher level operation, such as providing an IEnumerable<string> that would tokenize the text so it can be iterated with a foreach loop.

foreach (var word in theText.ToWords(new CultureInfo("th-th")))
{
   // consume each word
}

Some thought needs to be given to thread safety, since BreakIterator requires a separate clone for each thread.

Spellout numbering

As far as I can tell, ICU4N doesn't include the spellout numbering capabilities of ICU4J.

I'm interested in assessing whether it's feasible to port this code and contribute it to the project. Having no familiarity with ICU-J internals, I wouldn't know where to start, but if you can provide any initial thoughts (perhaps you've looked at it and decided it's too hard...) then I'd appreciate any pointers.

Alternatively, rather than doing it ourselves we could sponsor the development.

Note, we are currently using ICU4N in the SaxonCS project for localised collation support.

Determine and implement a solution for loading custom data

A core feature of the ICU project is to allow the end user to override the data in many ways (3 if I recall correctly). We need to research the best way to allow users to inject the data.

Unfortunately, a lot of time has passed since this issue was first noted until it was documented, so I don't recall all of the specific details. But the documentation is here.

This issue is blocking us from going to beta.

DOCs: Automate documentation generation

ICU has pretty good documentation already. However, many of the APIs were converted to be more .NET-like so we should probably at least have API docs.

We need to discuss with the ICU team whether it is within the realm of possibility for us to contribute ICU4N back to them to maintain. That would definitely factor into how this is done.

Convert Public Java Iterator classes to .NET Enumerators or Mark Internal

With the possible exception of BreakIterator and its subclasses, all public iterators should be converted to enumerators or (if not critical end user functionality) marked internal so they can be dealt with later.

Most of this work has already been completed, with the exception of iterators in the ICU4N.Collation assembly.

Finish UnicodeSet implementation

The following ISet<T> members are not yet implemented in UnicodeSet.

  1. IsProperSubsetOf
  2. IsProperSupersetOf
  3. IsSubsetOf
  4. SymmetricExceptWith

AFAIK, this is the only work left to do on UnicodeSet features, but should review to ensure that is the case.

Add target for .NET 6/8 or .NET Standard 2.0/2.1

I'm using SaxonCS 12.4 which references ICU4N and ICU4N.Resources. The ICU4N.Resources NuGet package is targeting .NET Standard 1.6 which brings in a lot of out of date NuGet packages that contain security vulnerabilities and need updating. Any plans to update the NuGet package to add targets for .NET 6/8 or update the existing target to .NET Standard 2.0/2.1 so we can get rid of these old package references?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.