nightowl888 / icu4n Goto Github PK
View Code? Open in Web Editor NEWInternational Components for Unicode for .NET
License: Apache License 2.0
International Components for Unicode for .NET
License: Apache License 2.0
If the data were somehow embedded in a shared assembly, we wouldn't need to have duplicates of the data inside each framework targeted binary. Instead of:
ICU4N.dll (net45 > 12MB)
ICU4N.dll (netstandard1.3 > 12MB)
ICU4N.dll (netstandard2.0 > 12MB)
We could have something like
ICU4N.dll (net45 > 1MB)
ICU4N.dll (netstandard1.3 > 1MB)
ICU4N.dll (netstandard2.0 > 1MB)
Data.dll (netstandard1.0 > 11MB)
Which scales much better for new target frameworks and saves tons of disk space. However, it is unclear what the best means of accomplishing that is, whether the DLL can/should be embedded into the NuGet package is a bit unclear.
Note also that using the name ICU4N.Resources.dll
for the assembly name is not possible because .Resources.dll
is reserved for another purpose.
The port was originally done from ICU4J, whose namespaces were meant to match the JDK. Looking at the structure of ICU4C, its structure is completely different.
Many (or most) of the classes defined in com.ibm.icu.text
would likely be in System.Globalization
if they were in .NET. For example, the Collator
class is a close match for System.Globalization.CompareInfo
, so it should be moved to a Globalization
namespace as well.
CompareInfo
is also a property of CultureInfo
, so we should probably look at moving the collator functionality (or at least part of it) into the main library.
Other classes, such as BreakIterator
also seem like good candidates for the Globalization
namespace. More analysis is needed to determine if anything really belongs in the Text
namespace.
Ideally, we would keep the project structure exactly the same as ICU4J so it is easy to port the diff between tags over to ICU4N file by file. However, if we take the approach the ICU team is using, there are simply headers in each file indicating which file(s) it is a port of so syncing is made easier. More thought needs to be given as to best do this, as ICU4N would best not be a complete line-by-line port of ICU4J because of the gaps in functionality between Java and .NET.
Getting UCultureInfo.CurrentCulture
will throw a StackOverflowException
if the current culture is any of the following: zh-CN
, zh-HK
, zh-MO
, zh-SG
, zh-TW
.
// If you're not running Windows in Chinese
Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("zh-TW");
var culture = UCultureInfo.CurrentCulture;
We need this for Intellisense support and possibly later for generating API documentation.
I'm getting different results for a number of tests (9 out of 919, so not too bad...). The test names relate to XPath 4.0 tests in
https://github.com/qt4cg/qt4tests/blob/master/fn/graphemes.xml
graphemes-1172
Input: "aπΏπΆ" (U+0061 U+1F3FF U+1F476)
Java result: 2 strings, "aπΏ", "πΆ" ((U+0061 U+1F3FF ; U+1F476)
C# result: 3 separate single-codepoint strings.
graphemes-1173
Input: ""aπΏπΆβπ" (U+0061 U+1F3FF U+1F476 U+200D U+1F6D1)
Java result: 2 strings, "aπΏ", "πΆβπ" (U+0061 U+1F3FF ; U+1F476 U+200D U+1F6D1)
C# result: 3 strings of lengths 1, 1, 3 respectively
Other failures, not specifically analysed (happy to send the results if needed):
graphemes-1180
graphemes-1181
graphemes-1182
graphemes-1183
graphemes-1184
graphemes-1185
graphemes-1189
It's possible of course that it's a Unicode version issue.
Much analysis must be done to ensure that the APIs contain all possible enums so all supported flags parameters can be passed, possibly by adding parameters for multiple enums.
This has already been done on the Normalizer
class, but there are still other classes that use an int
for flags which can accept a wide range of options. We don't necessarily need to make this any more than cosmetic - the values of the [Flags]
enum values can remain the same as the constants that are being used now.
I have found that .NET 7 or .NET 8 MAUI projects stop building with VS 2022 with errors like shown below as soon as I add SaxonCS as a NuGet package to the project; SaxonCS (12.4) references ICU4N.Resources 60.1.0-alpha.402.
My translation of the (German) error message VS is giving below (repeatedly) is "Resource data contains several files with the same target path"
Fehler APPX1101 Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-HK\ICU4N.resources.dll". Quelldateien:
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-HK\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-HK\ICU4N.resources.dll MauiNet7SaxonCS12Test1 C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets 1504
Fehler APPX1101 Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-SG\ICU4N.resources.dll". Quelldateien:
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hans-SG\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-SG\ICU4N.resources.dll MauiNet7SaxonCS12Test1 C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets 1504
Fehler APPX1101 Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-MO\ICU4N.resources.dll". Quelldateien:
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-MO\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-MO\ICU4N.resources.dll MauiNet7SaxonCS12Test1 C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets 1504
Fehler APPX1101 Die Nutzdaten enthalten mehrere Dateien mit dem gleichen Zielpfad "zh-TW\ICU4N.resources.dll". Quelldateien:
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-Hant-TW\ICU4N.resources.dll
C:\Users\marti\.nuget\packages\icu4n.resources\60.1.0-alpha.402\lib\netstandard1.0\zh-TW\ICU4N.resources.dll MauiNet7SaxonCS12Test1 C:\Users\marti\.nuget\packages\microsoft.windowsappsdk\1.2.221209.1\buildTransitive\Microsoft.Build.Msix.Packaging.targets 1504
I am not sure what is causing this, some other .NET 8/VS 2022 project types like WPF build/work fine with SaxonCS referencing ICU4N.Resources 60.1.0-alpha.402.
Any idea whether that is something you can fix with ICU4N.Resources or whether I need to try to file that as a bug on .NET 7/8 MAUI/VS 2022?
Similar methods were part of the ICU4J implementation. The rules for how to implement these methods for VB are documented here.
Java does not support ref
or out
parameters, so as a workaround methods were designed to accept or return array types of specific lengths (each element representing a specific value). These methods need to be analyzed and changed to use ref
or out
parameters, where appropriate.
/// <summary>
/// Parse a single non-whitespace character '<paramref name="ch"/>', optionally
/// preceded by whitespace.
/// </summary>
/// <param name="id">The string to be parsed.</param>
/// <param name="pos">INPUT-OUTPUT parameter. On input, pos[0] is the
/// offset of the first character to be parsed. On output, pos[0]
/// is the index after the last parsed character. If the parse
/// fails, pos[0] will be unchanged.</param>
/// <param name="ch">The non-whitespace character to be parsed.</param>
/// <returns>true if '<paramref name="ch"/>' is seen preceded by zero or more
/// whitespace characters.</returns>
public static bool ParseChar(string id, int[] pos, char ch)
{
int start = pos[0];
pos[0] = PatternProps.SkipWhiteSpace(id, pos[0]);
if (pos[0] == id.Length ||
id[pos[0]] != ch)
{
pos[0] = start;
return false;
}
++pos[0];
return true;
}
Can be changed to:
/// <summary>
/// Parse a single non-whitespace character '<paramref name="ch"/>', optionally
/// preceded by whitespace.
/// </summary>
/// <param name="id">The string to be parsed.</param>
/// <param name="pos">INPUT-OUTPUT parameter. On input, pos is the
/// offset of the first character to be parsed. On output, pos
/// is the index after the last parsed character. If the parse
/// fails, pos will be unchanged.</param>
/// <param name="ch">The non-whitespace character to be parsed.</param>
/// <returns>true if '<paramref name="ch"/>' is seen preceded by zero or more
/// whitespace characters.</returns>
public static bool ParseChar(string id, ref int pos, char ch)
{
int start = pos;
pos = PatternProps.SkipWhiteSpace(id, pos);
if (pos == id.Length ||
id[pos] != ch)
{
pos = start;
return false;
}
++pos;
return true;
}
Be sure to update the documentation appropriately to reflect the changes.
Of course, for this example to compile, all callers of ParseChar()
as well as the PatternProps.SkipWhiteSpace()
method will need to be modified, as well.
To find the offending methods, I have created code analyzers in the lucenenet-codeanalysis-dev Visual Studio extension. Just install the extension and filter for the following:
LuceneDev1003
- Finds all methods that accept an array parameter (except for char[]
)LuceneDev1004
- Finds all methods that return an array parameter (except for char[]
)I'm getting an exception with Collator.getInstance(). I'm running in Rider. My project has a nuget dependency on ICU4N 60.1.0-alpha.402. Note the reference to ICU4N.resources version=60.0.0.0.
I thought I previously had this working but I might have been mistaken, because the application has a fallback path that catches the exception.
(Note, it would be great to get an ICU4N version that isn't at alpha status, since this leads to build warnings).
Michael Kay
System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
---> System.TypeInitializationException: The type initializer for 'DotNetLocaleHelper' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Impl.ICUData' threw an exception.
---> System.IO.FileNotFoundException: Could not load file or assembly 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'. The system cannot find the file specified.
File name: 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'
at System.Reflection.RuntimeAssembly.InternalLoad(AssemblyName assemblyName, StackCrawlMark& stackMark, AssemblyLoadContext assemblyLoadContext, RuntimeAssembly requestingAssembly, Boolean throwOnFileNotFound)
at System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly(CultureInfo culture, Version version, Boolean throwOnFileNotFound)
at System.Reflection.RuntimeAssembly.GetSatelliteAssembly(CultureInfo culture, Version version)
at ICU4N.Impl.ICUData..cctor()
--- End of inner exception stack trace ---
at ICU4N.Impl.ICUData.GetLocaleIDFromResourceName(String resourceName)
at ICU4N.Impl.ICUData.GetStream(Assembly loader, String resourceName, Boolean required)
at ICU4N.Impl.ICUBinary.GetData(Assembly assembly, String resourceName, String itemPath, Boolean required)
at ICU4N.Impl.ICUBinary.GetData(Assembly assembly, String resourceName, String itemPath)
at ICU4N.Impl.ICUResourceBundleReader.<>c__DisplayClass35_0.b__0(ReaderCacheKey key)
at ICU4N.Impl.SoftCache2.<>c__DisplayClass1_1.<GetOrCreate>b__1() at System.Lazy
1.ViaFactory(LazyThreadSafetyMode mode)
at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy
1.CreateValue()
at ICU4N.Impl.SoftCache2.GetOrCreate(TKey key, Func
2 valueFactory)
at ICU4N.Impl.ICUResourceBundleReader.GetReader(String baseName, String localeID, Assembly root)
at ICU4N.Impl.ICUResourceBundle.CreateBundle(String baseName, String localeID, Assembly root)
at ICU4N.Impl.ICUResourceBundle.<>c__DisplayClass64_0.b__0(String key)
at ICU4N.Impl.SoftCache2.<>c__DisplayClass1_1.<GetOrCreate>b__1() at System.Lazy
1.ViaFactory(LazyThreadSafetyMode mode)
at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy
1.CreateValue()
at ICU4N.Impl.SoftCache2.GetOrCreate(TKey key, Func
2 valueFactory)
at ICU4N.Impl.ICUResourceBundle.InstantiateBundle(String baseName, String localeID, String defaultID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, String defaultID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, Boolean disableFallback)
at ICU4N.Util.UResourceBundle.<>c__DisplayClass22_0.b__0(String key)
at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func
2 valueFactory)
at ICU4N.Util.UResourceBundle.GetRootType(String baseName, Assembly root)
at ICU4N.Util.UResourceBundle.InstantiateBundle(String baseName, String localeName, Assembly root, Boolean disableFallback)
at ICU4N.Impl.ICUResourceBundle.CreateUCultureList(String baseName, Assembly root)
at ICU4N.Impl.ICUResourceBundle.AvailEntry.<>c__DisplayClass7_0.b__0()
at System.Threading.LazyInitializer.EnsureInitializedCore[T](T& target, Func1 valueFactory) at ICU4N.Impl.ICUResourceBundle.AvailEntry.GetUCultureList(UCultureTypes types) at ICU4N.Impl.ICUResourceBundle.GetUCultures(String baseName, Assembly assembly, UCultureTypes types) at ICU4N.Impl.ICUResourceBundle.GetUCultures(UCultureTypes types) at ICU4N.Globalization.UCultureInfo.GetCultures(UCultureTypes types) at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.LoadNonGregorianDefaultCalendars() at System.Threading.LazyInitializer.EnsureInitializedCore[T](T& target, Boolean& initialized, Object& syncLock, Func
1 valueFactory)
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.EnsureInitialized()
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.UCultureInfo.DotNetLocaleHelper.EnsureInitialized()
at ICU4N.Globalization.CultureInfoExtensions..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.CultureInfoExtensions.ToUCultureInfo(CultureInfo culture)
at ICU4N.Globalization.UCultureInfo.GetCurrentCulture()
at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
at ICU4N.Text.Collator.GetInstance()
at Saxon.Eej.expr.sort.UcaCollatorUsingIcu..ctor(String uri) in /Users/mike/GitHub/saxon13/build/cs/ee/Saxon/Eej/expr/sort/UcaCollatorUsingIcu.cs:line 22
The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.CultureInfoExtensions' threw an exception.
---> System.TypeInitializationException: The type initializer for 'DotNetLocaleHelper' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Impl.ICUData' threw an exception.
---> System.IO.FileNotFoundException: Could not load file or assembly 'ICU4N.resources, Version=60.0.0.0, Culture=neutral, PublicKeyToken=efb17c8e4f0e291b'. The system cannot find the file specified.
In .NET, classes tend to be flattened into namespaces rather than nested like Russian dolls as they are in Java. We should try to avoid this, where practical.
Unfortunately, the <Class Name>Extension.cs
convention we chose for file naming conflicts with some existing ICU class names and is very easily confused with the <Class Name>Extensions.cs
common convention for extension methods.
The .generated.cs
suffix is in use by several projects and removes ambiguity about how the code files are derived. This can also help with attempting to "fix" the code file instead of the T4 template, the former of which will be overwritten the next time the template is run.
It would be best to wait until #40 is done before working on this task.
What is the state of this ICUJ port and is a release scheduled?
Several methods that are intended to be called by other methods (often in a tight loop) are set up to new up a StringBuilder
at the beginning of the method and then return the string at the end. A StringBuilder
instance allocates on the heap and so are the underlying array chunks that it manages, so doing this inline when processing strings causes excessive garbage collection. A better alternative is to use ValueStringBuilder
with a stack allocated initial buffer (usually 32 chars will suffice). ValueStringBuilder
is a ref struct that allocates on the stack. So replacing it means the entire operation occurs on the stack unless it runs out of buffer, in which case it will automatically move the chars over to the heap (from the array pool, so it can be reused across method calls).
public static string Escape(string s)
{
StringBuilder buf = new StringBuilder();
for (int i = 0; i < s.Length;)
{
int c = Character.CodePointAt(s, i);
i += UTF16.GetCharCount(c);
if (c >= ' ' && c <= 0x007F)
{
if (c == '\\')
{
buf.Append("\\\\"); // That is, "\\"
}
else
{
buf.Append((char)c);
}
}
else
{
bool four = c <= 0xFFFF;
buf.Append(four ? "\\u" : "\\U");
buf.Append(Hex(c, four ? 4 : 8));
}
}
return buf.ToString();
}
Can be changed to:
public static string Escape(string s)
{
#if FEATURE_SPAN
ValueStringBuilder buf = new ValueStringBuilder(stackalloc char[CharStackBufferSize]);
#else
StringBuilder buf = new StringBuilder(s.Length);
#endif
for (int i = 0; i < s.Length;)
{
int c = Character.CodePointAt(s, i);
i += UTF16.GetCharCount(c);
if (c >= ' ' && c <= 0x007F)
{
if (c == '\\')
{
buf.Append("\\\\"); // That is, "\\"
}
else
{
buf.Append((char)c);
}
}
else
{
bool four = c <= 0xFFFF;
buf.Append(four ? "\\u" : "\\U");
buf.Append(Hex(c, four ? 4 : 8));
}
}
return buf.ToString();
}
Note that ValueStringBuilder
is disposable, but calling ToString()
disposes it automatically. However, it is better to use .AsSpan()
to keep the buffer on the stack as long as possible if the method does not return it as a string immediately. If ToString()
is not called, it should be disposed by either putting it in a using
block or wrapping it in a try/finally block to explicitly call Dispose()
.
The above code still heap allocates a string at the end (which is less than ideal), but at least we save the allocation of the StringBuilder
and its underlying memory, which if done project-wide will be very impactful for performance and won't impact the API at all.
Note that we still need to conditionally compile with FEATURE_SPAN
and use StringBuilder
because net40
doesn't support spans. However, in this case we can pass s.Length
to the constructor to pre-allocate the correct amount of heap.
A few violations:
IcuService
instead of ICUService
).Properties
rather than Props
).U
. Ideally, these would not be prefixed this way as well.System.InvalidOperationException: Failed to compare two elements in the array.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Text.Transliterator' threw an exception.
---> System.TypeInitializationException: The type initializer for 'ICU4N.Globalization.UCultureInfo' threw an exception.
---> System.Globalization.CultureNotFoundException: Only the invariant culture is supported in globalization-invariant mode. See https://aka.ms/GlobalizationInvariantMode for more information. (Parameter 'name')
en is an invalid culture identifier.
at System.Globalization.CultureInfo..ctor(String name, Boolean useUserOverride)
at ICU4N.Globalization.UCultureInfo..cctor()
--- End of inner exception stack trace ---
at ICU4N.Globalization.UCultureInfo.get_CurrentCulture()
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, OpenType openType)
at ICU4N.Impl.ICUResourceBundle.GetBundleInstance(String baseName, String localeID, Assembly root, Boolean disableFallback)
at ICU4N.Util.UResourceBundle.<>c__DisplayClass25_0.<GetRootType>b__0(String key)
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at ICU4N.Util.UResourceBundle.GetRootType(String baseName, Assembly root)
at ICU4N.Util.UResourceBundle.InstantiateBundle(String baseName, String localeName, Assembly root, Boolean disableFallback)
at ICU4N.Util.UResourceBundle.GetBundleInstance(String baseName, String localeName, Assembly root, Boolean disableFallback)
at ICU4N.Util.UResourceBundle.GetBundleInstance(String baseName, String localeName, Assembly root)
at ICU4N.Text.Transliterator..cctor()
--- End of inner exception stack trace ---
at ICU4N.Text.Transliterator.GetInstance(String id)
To my understanding, the transliterator should use its own culture data and run with Invariant mode?
Or am I wrong and this has to run with the culture data installed?
ICU4N.Dev.Test.Collate.CollationServiceTest::TestRegisterFactory() is failing on Linux due to the Collator.GetDisplayName(UCultureInfo)
method not returning the correct name for a custom UCultureInfo
instance registered with Collator.GetDisplayName()
.
The test is known to fail on Ubuntu 18.0.4 and Ubuntu 20.0.4 and only on .NET 5 and higher. While this makes it highly likely the issue is related to the ICU integration in .NET, changing to NLS seems to have no effect on the problem.
More investigation is needed, but due to the fact this is only returning display text for a culture it doesn't seem to be a blocker for releases at this point.
The property that should be named ICU4N.Text.CollationElementIterator.Ignorable
is misspelt Ingorable
.
ICU4N comes with many localized files already, however, they do not come in the format that .NET expects. Some analysis needs to be done to determine how to make this happen so we don't need to embed all resources inside of the main DLLs.
We need to update our build to automatically generate the files that are template-based. Until now, we have been doing this manually.
There is an example of this here: https://github.com/libgit2/libgit2sharp/blob/0e7ec84e1a339e0215b71f85244dd06a06d82d61/LibGit2Sharp/LibGit2Sharp.csproj#L27-L28.
In Java, the UCharacter
class has all of the same APIs as the java.lang.Character
class. Therefore, in .NET, we need UChar
to cover the same APIs (static methods) as System.Char
.
We need more clarity on specifically how to design the culture-sensitivity part of this, since in Java these methods were designed to use the equivalent of CultureInfo.InvariantCulture
and therefore are not aware of the ambient culture.
We also need tests to validate expected behaviors with the new methods.
CharacterIterator
and classes that depend on it are the only classes in ICU4N.Support
that are still marked as public. Ideally, when ICU4N is released, there will be no public facing ICU4N.Support
namespace.
The CharacterIterator
class should be converted into an ICharacterEnumerator
interface and moved to J2N, and all implementations converted also.
This has been attempted and is partially completed, but due to the fact that CharacterIterator
uses post-increment behavior, it doesn't work very well. It was fairly easy to get J2N.Text.StringCharacterEnumerator
working with ICU4N, but not so for Lucene.NET.
The approach taken was to make a wrapper class so the ICharacterEnumerator
could be passed in, and behind the scenes it would be wrapped by a class that implements CharacterIterator
(which was made internal). This didn't work in the opposite direction when converting the rest of the CharacterIterator
classes into ICharacterEnumerator
instances that can be passed to implementations of the BreakIterator
abstract class. I am sure it is possible, but more effort is required to work out how to make it behave correctly (being that iterators return a value, and enumerators return true/false and then a property must be read but in the case of CharacterIterator
, the property must somehow be read before the call to MoveNext()
or MovePrevious()
).
In #3, a UCutureInfo
type was created to replace the ICU4J ULocale
type. The main goals for doing this are:
CultureInfo
classWe managed to get a prototype in place, but there are still some remaining tasks and research to complete.
TwoLetterISOLanguageName/GetTwoLetterISOLanguageName
returns null
for invariant culture, which is different than the documented behavior of ICU4J (at least for the 3 letter language), which is to return empty string. In .NET, the behavior is to return the string "iv"
.Parent
property behaves differently in than the .NET platform. In ICU, the script tag is not considered as part of the fallback behavior, but in .NET it is (i.e. uz_Cyrl_UZ
falls back to uz_Cyrl
in .NET, but in ICU it falls back to uz
). The AcceptLanguage
overloads of UCultureInfo
depend on the current behavior for the tests to pass. For now, Parent
has been marked internal until this can be addressed.CurrentCulture
and CurrentUICulture
when not explicitly set is to track the properties of the CultureInfo.CurrentCulture
and CultureInfo.CurrentUICulture
. So, when either of the latter changes, the former are automatically updated to the nearest corresponding culture. If set explicitly, this tracking stops and the culture that is set is used instead. However, once set there is currently no way to "unset" UCultureInfo.CurrentCulture
or UCultureInfo.CurrentUICulture
to get back to the original tracking behavior.ULocale
class in ICU4J was immutable and its Clone
method simply returned itself. However, in .NET UCultureInfo
was designed to be mutable unless it is wrapped using the ReadOnly
method. For now, UCultureInfo
is immutable and marked sealed
so the behavior cannot be changed.Name
property effectively returns the base name in ICU4J. They are similar, however, in .NET they are typically delimited by hyphen and in ICU they are delimited by underscore. Through some limited testing, it appears both UCultureInfo
and CultureInfo
accept either format. More research is required to determine whether changing to the .NET convention makes sense.CultureInfo
The following members of CultureInfo
are not yet present in UCultureInfo
.
Calendar
- NOTE: ICU4J has its own calendars that are not ported and not planned for the current releaseCompareInfo
- This is the equivalent of the Collator
class in ICU4J, however the API is more extensible in ICU4J and custom sort rules can be defined. It may not make sense to merge the two.DateTimeFormat
IsReadOnly
KeyboardLayoutId
LCID
NumberFormat
- Added in #46.OptionalCalendars
Parent
- Marked internal, since the behavior differs from .NET (see above)TextInfo
- Although ICU has properties that make sense to put here, the constructor is internalThreeLetterWindowsLanguageName
UseUserOverride
public static UCultureInfo CreateSpecificCulture(string name)
public UCultureInfo GetConsoleFallbackUICulture()
public static UCultureInfo GetCultureInfo(string name)
public static UCultureInfo GetCultureInfo(int culture)
public static UCultureInfo GetCultureInfo(string name, string altName)
- This one seems to be a duplicate of passing the keyword (i.e. @collation=phonebook
), but in ICU4N, there is currently no culture cache.public virtual object GetFormat(Type formatType)
- Added in #46.public static UCultureInfo ReadOnly(UCultureInfo ci)
- Added in #46.ULocale
CharacterOrientation
- Marked internal, since the logical place to put it would be on TextInfo
LineOrientation
- Marked internal, since the logical place to put it would be on TextInfo
IsRightToLeft
- Marked internal, since the logical place to put it would be on TextInfo
UCultureInfoBuilder
UCultureInfoBuilder
ICU4J's ULocale
class has a nested Builder
class that has been de-nested and marked internal. Its purpose is to safely build a locale object while validating the inputs, as the UCultureInfo
class provides no such validation. The CultureInfo
class also doesn't provide validation upon creation, but will throw an exception if the requested culture doesn't exist on the platform.
More research is needed to determine if a similar function exists on the .NET platform so we can correctly map this functionality (which does exist on the Java platform) or whether it makes sense to keep it as is and add it to the public API.
Calling VersionInfo.getInstance("unknown")
produces the error message "Invalid version number: Version number may be negative or greater than 255". I suspect "may not" was intended (or perhaps "may" should be "might"?), but that doesn't exactly capture the error either...
TryGet versions of Get methods were added to UScript
in order to avoid using exceptions for control flow. However, we need tests for those methods to confirm functionality behaves as expected.
For example, GetLong()
should be converted to GetInt64()
, GetFloat()
to GetSingle()
, etc.
We will need the ReadOnlySpan<char>
overloads to optimize utility methods and avoid Substring()
calls, which are extremely slow in .NET compared to Java. This requires updating the CodeGenerationSettings.xml
file to include them and editing the T4 templates to correctly generate the overloads.
startIndex
, endIndex
, limit
, length
should omit these parameters and rely on the Slice()
method of ReadOnlySpan<char>
instead.It would be best to wait until #40 is done before working on this task.
The MessageFormat
class is only partially implemented. It was actually only ported for the use of ChoiceFormat
, which is required by Transliterator
to load resources. But it currently doesn't work much beyond that purpose.
MessageFormat
has many dependencies, including DateFormat
and RuleBasedNumberFormat
that would also need to be ported in order to make it complete.
Ideally, these format classes should implement ICustomFormatter
and IFormatProvider
to be compatible with string.Format()
and other .NET APIs. Also, they should ideally not utilize CultureInfo
as a property themselves, but be passed a CultureInfo
instance when they do their work. That might not be feasible in all cases (such as rule-based number format). More analysis is required to work out the best approach for .NET compatibility.
We have basic support for being able to exclude resource distributions, but it is mostly manual. We need some more automation for this to go smoothly.
buildTransitive/<assemblyName>.targets
file that has a PackageReference to the custom NuGet package.Copying the ICU4N.targets
file worked pretty well. We just need to automate the process of packing a copy of this file as buildTransitive/<assemblyName>.targets
in the appropriate scenarios (giving the user the ability to opt in or out of this) when a custom distribution is involved and to reference the custom package name.
Similar methods were part of the ICU4J implementation. The rules for how to implement these methods for C# are documented here.
We will need the ValueStringBuilder
overloads to optimize utility methods and avoid heap allocations in the main business logic. This requires updating the CodeGenerationSettings.xml
file to include them and editing the T4 templates to correctly generate the overloads.
Note that ValueStringBuilder
is a ref struct, so when passed into methods as a parameter, it must be passed as a ref
argument. It is also declared internal, so all methods that are generated with it must also be internal.
public override IAppendable Normalize(string src, IAppendable dest)
{
try
{
return dest.Append(src);
}
catch (IOException e)
{
throw new ICUUncheckedIOException(e); // Avoid declaring "throws IOException".
}
}
For the above method, an overload can be generated as:
internal override void Normalize(string src, ref ValueStringBuilder dest)
{
try
{
return dest.Append(src);
}
catch (IOException e)
{
throw new ICUUncheckedIOException(e); // Avoid declaring "throws IOException".
}
}
It would be best to wait until #40 is done before working on this task.
A JDK-style ByteBuffer
class was ported from Apache Harmony to make porting the code from Java quicker. However, this class ended up in several public APIs.
Eventually, we should factor out ByteBuffer
and create a set of extension methods on byte[]
and other numeric types to make the same low-level conversions that ByteBuffer
does. So, ByteBuffer
should not be exposed on the public API, we should instead use byte[]
when passing the data around.
Currently all building and testing is done on Azure DevOps. However, building and testing on the command line is undocumented.
While it is possible to build using dotnet build
, dotnet pack
, dotnet test
and other commands, it would be simpler for potential contributors if there were a wrapper script to launch these commands.
The technology used for the build script is not that important as long as it runs cross-OS, but a way to get this done without adding any additional technology would simply be to make the script in MSBuild.
The build-pack-and-publish-libraries.yml file can be used as a template for the tasks in the build script.
However it is done, the README.md page should be updated with documentation on how to build/test from the command line.
This issue is blocking ICU4N from going to beta.
In Java, the ULocale
class was created as a separate entity in ICU4J because:
Locale
class is sealed/finalLocale
is stored in a static field, not associated with the current thread like CultureInfo
is in .NETTo make ICU4N compatible with .NET, a subclass of CultureInfo
should be created (named UCultureInfo
, for consistency) to replace ULocale
.
This subclass should have all of the functionality of ULocale
, but subclassing CultureInfo
makes the class automatically compatible with the properties that allow it to be set to the current thread.
There are several gaps in functionality between Java/.NET that need to be addressed. Including:
nn-NO-NY
to/from nn-NO
There may be others, more analysis is needed to get the complete list. Perhaps a conversation with the ICU team and/or Microsoft is required to work out how to account for these gaps.
The constructor of UCultureInfo
needs to accept both standard .NET cultures (in which case, we will just wrap the default culture), and it also should accept ICU-style parameters. A decision needs to be made whether to keep the ICU4N underscores (en_US@collation=phonebook
), or to make it more like the .NET style (en-US@collation=phonebook
). In the JDK, there are no underscores, the values are passed as separate parameters to the constructor of Locale
, so there is no definite precedence to follow.
We may need to take the ICU data into consideration when making this decision.
.NET actually has more calendars than ICU, so it is unclear exactly how to deal with this just yet. ICU is meant to be an "up to date" version of Unicode that is more recent than .NET, but we need to check whether the calendars in .NET are being kept up with the standard.
Unlike Java, we won't need to convert UCultureInfo
> CultureInfo
as we are a subclass so UCultureInfo
is a CultureInfo
already.
To convert the other way, my thought is that we should use the same pattern as was done on the System.Type
class in .NET.
var name = this.GetType().GetTypeInfo().Name;
So, we could do a similar thing with CultureInfo
by creating an extension method:
var ucultureInfo = CultureInfo.CurrentCulture.GetUCultureInfo();
This would allow us to add additional properties to the subclass for use by ICU4N and/or its users.
There are probably additional things that need to be considered. More analysis is required.
Much of the public API is marked [Obsolete]
. It is unclear why these features are not removed upon some major release, but kept in and maintained indefinitely by the ICU team. We should probably contact them to get some clarity on why this was done, and what their decision would be on accessibility of those classes/members if they were building ICU today.
Most likely, the right choice for us is to make these members internal.
While BreakIterator
provides great low-level functionality for iterating forward and backward through breaks, it would be great if there were a simple way to do forward-only operations on string
, StringBuilder
, and char[]
.
IEnumerable<int> wordBreaks = theString.ToWordBreaks();
foreach (var break in wordBreaks)
{
// consume
}
Or
IEnumerable<int> sentenceBreaks = theString.ToSentenceBreaks(new CultureInfo("th"));
foreach (var break in sentenceBreaks)
{
// consume
}
We would ideally create a different extension method (with overloads for optional culture) for all 4 modes:
We could then expand on this to do a higher level operation, such as providing an IEnumerable<string>
that would tokenize the text so it can be iterated with a foreach
loop.
foreach (var word in theText.ToWords(new CultureInfo("th-th")))
{
// consume each word
}
Some thought needs to be given to thread safety, since BreakIterator
requires a separate clone for each thread.
As far as I can tell, ICU4N doesn't include the spellout numbering capabilities of ICU4J.
I'm interested in assessing whether it's feasible to port this code and contribute it to the project. Having no familiarity with ICU-J internals, I wouldn't know where to start, but if you can provide any initial thoughts (perhaps you've looked at it and decided it's too hard...) then I'd appreciate any pointers.
Alternatively, rather than doing it ourselves we could sponsor the development.
Note, we are currently using ICU4N in the SaxonCS project for localised collation support.
#38 includes an ICU4N.targets
file that contains configuration settings allow for custom distributions of resource data. These settings will need documentation once all of the configuration settings have been made user-configurable.
A core feature of the ICU project is to allow the end user to override the data in many ways (3 if I recall correctly). We need to research the best way to allow users to inject the data.
Unfortunately, a lot of time has passed since this issue was first noted until it was documented, so I don't recall all of the specific details. But the documentation is here.
This issue is blocking us from going to beta.
ICU has pretty good documentation already. However, many of the APIs were converted to be more .NET-like so we should probably at least have API docs.
We need to discuss with the ICU team whether it is within the realm of possibility for us to contribute ICU4N back to them to maintain. That would definitely factor into how this is done.
With the possible exception of BreakIterator
and its subclasses, all public iterators should be converted to enumerators or (if not critical end user functionality) marked internal so they can be dealt with later.
Most of this work has already been completed, with the exception of iterators in the ICU4N.Collation assembly.
The following ISet<T>
members are not yet implemented in UnicodeSet
.
IsProperSubsetOf
IsProperSupersetOf
IsSubsetOf
SymmetricExceptWith
AFAIK, this is the only work left to do on UnicodeSet
features, but should review to ensure that is the case.
See apache/lucenenet#417 for a more complete description.
This issue also applies to the GetOrCreate()
method of SoftCache
, which should be reviewed.
It is suspected that this may be the source of concurrency issues with ThaiTokenizer
and ICUTokenizer
tests of Lucene.NET, which are known to fail without extra locking that was not part of the original design.
For example:
public const SOME_VALUE = "theValue";
Should be converted to Pascal Case:
public const SomeValue = "theValue";
I'm using SaxonCS 12.4 which references ICU4N and ICU4N.Resources. The ICU4N.Resources NuGet package is targeting .NET Standard 1.6 which brings in a lot of out of date NuGet packages that contain security vulnerabilities and need updating. Any plans to update the NuGet package to add targets for .NET 6/8 or update the existing target to .NET Standard 2.0/2.1 so we can get rid of these old package references?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.