This is a spin-off of <a class="issue-link js-issue-link" data-error-text="Failed to l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support proper encoding of strings with multi-valued encodings about fo-dicom HOT 11 OPEN

mrbean-bremen commented on July 3, 2024

Support proper encoding of strings with multi-valued encodings

from fo-dicom.

Comments (11)

chen-wu commented on July 3, 2024 2

I also invetigated it a little more. Our problem here is the difference between the code page of dotnet and the DICOM standard. So, I tried to use the ICU to convert the UTF16 string to the target charset. However, the dotnet version of ICU library did not implement the conversion.

I will try to check whether other encodings have simlilar problem or not. If not, maybe we can just add patch for ISO 2022 IR 13 only. Or we just mark it as an exception that we cannot handle.

from fo-dicom.

mrbean-bremen commented on July 3, 2024

I'll have a look at this, though probably not for the current release.

from fo-dicom.

chen-wu commented on July 3, 2024

It seems whether we need to use another encoder depends on whether the current encoder throws exception or not.

However, the encoding in dotnet may be different with the DICOM standard. Some code page may cover more characters.
For example, https://dicom.nema.org/medical/dicom/current/output/html/part05.html#sect_H.3.2.
"ISO 2022 IR 13" is JIS X 0201. And "ISO 2022 IR 87" is JIS X 0208.

But the Shift JIS contains characters for both 0201 and 0208. So, when trying to use IR13(ShiftJIS) to encode the whole string, it will not throw any exception.

These bytes can be decoded correctly in the same way. As they will eventually be processed as Shift JIS.

So, I think it is too difficult to implement such features. Maybe we should just use UTF8 if necessary. Otherwise, we will need to identify the difference between the dotnet encodings with the DICOM ones.

from fo-dicom.

mrbean-bremen commented on July 3, 2024

@chen-wu - yes, handling JIS X 201 vs JIS X 208 can indeed be a problem, and we need probably add specific code for that. I actually was aware that there might be some problems with that, but decided to ignore this in the PR for the time being, as it may get a little complicated, and I first wanted to cover the main cases (to my knowledge, this is the only case that may cause such problems).
May we could handle this in a separate issue after this one is finished to avoid too much complexity in a single PR. Anyway, all of this will get integrated only after the next release (which hopefully will be soon).

from fo-dicom.

mrbean-bremen commented on July 3, 2024

@chen-wu - do you have some example or test that shows the problem? E.g. a string that is decoded for ISO 2022 IR 13 but is not valid JIS X 201? (or the same with ISO 2022 IR 87 / JIS X 208)?

from fo-dicom.

chen-wu commented on July 3, 2024

@mrbean-bremen This example should work.

And you can use following code

        Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
        var raw = new byte[] {
            0xD4, 0xCF, 0xC0, 0xDE, 0x5E, 0xC0, 0xDB, 0xB3, 0x3D, 0x1B, 0x24, 0x42, 0x3B, 0x33, 0x45, 0x44, 0x1B, 0x28,
            0x4A, 0x5E, 0x1B, 0x24, 0x42, 0x42, 0x40, 0x4F, 0x3A, 0x1B, 0x28, 0x4A, 0x3D, 0x1B, 0x24, 0x42, 0x24, 0x64,
            0x24, 0x5E, 0x24, 0x40, 0x1B, 0x28, 0x4A, 0x5E, 0x1B, 0x24, 0x42, 0x24, 0x3F, 0x24, 0x6D, 0x24, 0x26, 0x1B,
            0x28, 0x4A,
        };
        var value = "ﾔﾏﾀﾞ^ﾀﾛｳ=山田^太郎=やまだ^たろう";
        var value2 = "山田";
        
        var shiftJisEncoding = Encoding.GetEncoding("shift_jis");
        var iso2022JpEncoding = Encoding.GetEncoding("iso-2022-jp");

        var result1 = shiftJisEncoding.GetBytes(value);
        var result2 = shiftJisEncoding.GetBytes(value2);
        var result3 = iso2022JpEncoding.GetBytes(value);
        var result4 = iso2022JpEncoding.GetBytes(value2);

You can see that the "山田" will not trigger exception in Shift-JIS encoding. And the 4 "results" will show that the "山田" is encoded into different bytes.

from fo-dicom.

mrbean-bremen commented on July 3, 2024

Thank you, I will have a look at this!

from fo-dicom.

chen-wu commented on July 3, 2024

I think we can just give up here, we have too many problems to solve. 😂

The DICOM standard defines some character sets. These character sets are just "standard". Different vendor has different implementation. They may add some additional characters to the code points which are not used in the standard or combine several standards into one code page.

The standard also has different versions. For example, the DICOM standard is using "JIS X 0208-1990", and the "iso-2022-jp" standard contains both "JIS X 0208-1978" and "JIS X 0208-1983", "iso-2022-jp-2" contains "JIS X 0208-1990". I did not check the implementation, it may be more complex.

In our case, the "shift-jis" contains both "JIS X 0201-1997" and "JIS X 0208-1997" characters. However, its 0208 part is not compatible with the 0208 standard. You will need to do some shift conversion. The EUC-JP may be correct.
https://en.wikipedia.org/wiki/Shift_JIS
That's why we cannot encode Japanese text correctly.

So, to make our implementation fully DICOM compliant, we may need to import correct versions of the character set standard from ISO or somewhere else and write the encoder by ourselves to eliminate the difference between the character set standard with the dotnet code pages.

If we still want to use the dotnet encodings, it would be better to also check the range of the bytes encoded to see if we need to move to next encoder.
It may need some additional work to confirm about that, but I think this will only happen in Japanese. For Chinese and Korean, they usually do not need multiple character sets to encode the text.

Note: If you want to check the PS3.3 of DICOM Standard. DO NOT use the HTML version.

from fo-dicom.

mrbean-bremen commented on July 3, 2024

It may need some additional work to confirm about that, but I think this will only happen in Japanese

That's what I thought. I think we have to go that extra step to be compliant, but it probably shall be done in an extra PR after #1791 has been merged (I'm holding that one as draft until the next release is out).

from fo-dicom.

Support proper encoding of strings with multi-valued encodings about fo-dicom HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent