Giter Site home page Giter Site logo

Comments (21)

ra-dave avatar ra-dave commented on June 13, 2024

Just to be clear, this issue is happening on Windows. I haven't been able to test if it happens on Linux yet but I suspect it probably does not happen on Linux.

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Hi!

Folders that have non-ASCII characters are not being extracted to the same name, the special characters get changed.

For example, a folder name of SottoMenù gets extracted to SottoMenù

According to this page, this indicates that the problem is being caused by UTF-8 bytes being interpreted as Windows-1252 (or ISO 8859-1) bytes.

https://www.i18nqa.com/debug/utf8-debug.html

By default, bit7z follows the UTF-8 Everywhere Manifesto since the v4, and hence it interprets std::strings as UTF-8 encoded.
On the contrary, 7-zip uses UTF-16 wide strings in its interface, so conversions between the two encodings happen when bit7z communicates with 7-zip.

However, if the current codepage of the program is not actually UTF-8, some problems like yours might arise where a UTF-8 std::string is interpreted using the current codepage encoding (e.g., Windows-1252) in other parts of the code. In particular, with recent versions of MSVC, bit7z uses C++17's std::filesystem::path, which might use the current codepage.

One possible fix, which I strongly recommend, is to enforce using the UTF-8 codepage at the application level, as explained here.
In short, you include a manifest file like this in your application.
I recommend this because it solves any string encoding issues, especially for applications that handle strings in languages having non-ASCII characters.
Unfortunately, this option is available only from Windows 10 1903.

Anyway, I am also considering making it optional to use the current codepage for string conversions. This would also fix your issue, provided that your program's current codepage supports the characters in the name of the files you're extracting; in the specific case you described, the Windows-1252 set does include the ù character, but this might not be true in general.

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Just to be clear, this issue is happening on Windows. I haven't been able to test if it happens on Linux yet but I suspect it probably does not happen on Linux.

Yeah, most Linux distros use UTF-8 by default, and hence should not have this issue.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

I set up a debugging environment today and will be trying some things tomorrow.

I don't think using a manifest is going to be a viable workaround for us (at least for right now) but I will investigate it. I will also look into std::filesystem::path. BTW, we were previously using minizip-ng and did not have this issue, so it may indeed be related to std::filesystem::path or something like that (which maybe minizip-ng does not use).

I will keep you updated, and if you have any other ideas or suggestions please let me know. This is a high priority for us.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

Using BIT7Z_USE_NATIVE_STRING seems to fix the issue. What do I need to watch out for if I use that? I just need to make sure all strings passed to bit7z are L""?

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Using BIT7Z_USE_NATIVE_STRING seems to fix the issue.

Yeah, sorry, I forgot to mention this option!
It basically makes bit7z use wide strings in its API, avoiding any encoding conversion.

What do I need to watch out for if I use that? I just need to make sure all strings passed to bit7z are L""?

The only cons of using BIT7Z_USE_NATIVE_STRING occur when the program has to be cross-platform and thus must also be built for Linux and macOS.
On these platforms, the string type is always std::string (since it's the native type).

In this case, bit7z provides the macro BIT7Z_STRING, which transforms a string literal to a wide string literal if needed (i.e., when BIT7Z_USE_NATIVE_STRING is enabled on Windows); the transformation is done at compile time by simply prepending the L before the string literal.

Example:

extractor.extract(BIT7Z_STRING("path/to/archive.7z"), BIT7Z_STRING("path/to/output/folder/"));

Apart from this, there should not be any other caution to be taken.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

I understand to use a literal string when using BIT7Z_USE_NATIVE_STRING I need to use BIT7Z_STRING but what do I do when I need to use a std::string as an argument instead of a literal string? For example:

writer.setPassword(password);

where password is a std::string

And whatever solution is used needs to be cross platform, without using #ifdefs preferably.

from bit7z.

ra-dpa avatar ra-dpa commented on June 13, 2024

I'm debugging the code. In processedItem.cpp:

   case BitPropVariantType::String:
        mFilePath = fs::path( prop.getString() );
        break;

The character is correct in prop but the call to prop.getString() changes the character.

image

I'm not understanding why prop.getString() would misconvert the string. prop is a wide string so I don't get why the conversion doesn't work. It's converting UTF-16 to UTF-8 so it should work regardless of code page.

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

I understand to use a literal string when using BIT7Z_USE_NATIVE_STRING I need to use BIT7Z_STRING but what do I do when I need to use a std::string as an argument instead of a literal string? For example:

writer.setPassword(password);

where password is a std::string

And whatever solution is used needs to be cross platform, without using #ifdefs preferably.

The only solution, at the moment, is to convert the std::string to a std::wstring to be passed to bit7z.

I'm not understanding why prop.getString() would misconvert the string. prop is a wide string so I don't get why the conversion doesn't work. It's converting UTF-16 to UTF-8 so it should work regardless of code page.

I guess you're not using BIT7Z_USE_NATIVE_STRING here.
If I'm not mistaken, the problem is not in the conversion performed inside prop.getString().

The problem is that the std::filesystem::path of MSVC's standard library internally uses the Windows native string type, so it needs to convert the std::string returned by prop.getString() back to a UTF-16 wide string.
I checked the source code of the MSVC's STL, and it seems to consider the std::string as encoded using the current codepage, while it is actually UTF-8 (due to the conversion performed by bit7z), hence the issue.

image

image

image

The only real fix will be to allow bit7z converting to std::string using the current code page. I'll add this option to the upcoming stable version of bit7z.

from bit7z.

ra-dpa avatar ra-dpa commented on June 13, 2024

Thanks for the insights into this problem. I am kind of shocked at how this issue exists in std::filesystem. It must be the source of many problems.

Is the reason you aren't using the UTF-8 filesystem pieces (for example, fs:u8path() instead of fs::path()) is to support older C++ versions? I literally changed this:

mFilePath = fs::path( prop.getString() );

to this:

mFilePath = fs::u8path( prop.getString() );

And my problem went away.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

I was not aware of this until this week when I started looking into this issue, but in our code we have a wrapper around std::filesystem::path so that UTF-8 works correctly on Windows. You could consider doing something like that if you don't want to use ::u8path.

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Thanks for the insights into this problem.

You're welcome!

I am kind of shocked at how this issue exists in std::filesystem. It must be the source of many problems.

Yeah, but to be fair, it's more of a problem of Windows not using UTF-8 for std::string as practically any other operating system.

Is the reason you aren't using the UTF-8 filesystem pieces (for example, fs:u8path() instead of fs::path()) is to support older C++ versions?

Actually, there's no specific reason.

When I started working on the cross-platform support in bit7z, I decided to stick with the UTF-8 Everywhere Manifesto, which mandates using UTF-8 encoded std::strings everywhere (and hence no need for fs::u8path).
This made the development more straightforward and the code simpler, at least initially.

Also, albeit fs::u8path was introduced only in C++17, it is already deprecated in C++20 due to the introduction of std::u8string.
As a side note, if the compiler is too old and doesn't support C++17's std::filesystem, bit7z uses a drop-in replacement library.

I'm actually thinking of implementing another possible fix: replacing every occurrence of CP_UTF8 with GetACP() in bit7z. In this way, every conversion between wide and narrow strings will use the current codepage rather than forcing the library user into using UTF-8.

In any case, I still recommend using UTF-8 in any program using bit7z. If the program encounters an archive with some Unicode character not supported by the system codepage, conversion errors might occur, and bit7z might not be able to do much about it.
Of course, if you are sure that your program doesn't work on archives with characters outside the system codepage, using UTF-8 is not needed.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

In any case, I still recommend using UTF-8 in any program using bit7z. If the program encounters an archive with some Unicode character not supported by the system codepage, conversion errors might occur, and bit7z might not be able to do much about it.
Of course, if you are sure that your program doesn't work on archives with characters outside the system codepage, using UTF-8 is not needed.

So just to be clear, there are three options for fixing this in my situation:

  • Enable UTF-8 global support in Windows. I tried it and it does work but we can't force our customers to use this.
  • Use a manifest to set the UTF-8 code page. We also can't force this on customers, and besides, they may be running an older version of Windows that doesn't support it.
  • Use the BIT7Z_ USE_ NATIVE_ STRING option and convert all std::string to std::wstring in our code for bit7z calls (but for Windows only).

Did I miss any other options?

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024
  • Enable UTF-8 global support in Windows. I tried it and it does work but we can't force our customers to use this.

Yeah, I wouldn't choose this option either.

  • Use a manifest to set the UTF-8 code page. We also can't force this on customers, and besides, they may be running an older version of Windows that doesn't support it.

I see. Yeah, this is the best option, but I understand why you cannot use it.

  • Use the BIT7Z_USE_NATIVE_STRING option and convert all std::string to std::wstring in our code for bit7z calls (but for Windows only).

This is not ideal, I understand.

Did I miss any other options?

No, there are no other option, at least at the moment.

I can distinguish between two cases where string encoding conversion is needed:

  1. When a string comes from or is provided by the public API of bit7z.
  2. When a string is handled internally by bit7z.

In the first case, the std::string is usually interpreted by the program using the system codepage, and bit7z can't do much about it (at least to my knowledge).

However, in the second case, bit7z has more control on how it can interpret the string. This is the case you noticed before, where mFilePath = fs::u8path( prop.getString() ); fixes the issue.

So another option might be the following:

  • At the public API level, giving the possibility to the library's user to choose whether to use either the system codepage or UTF-8 for strings.
  • Internally, bit7z will always use UTF-8 (if it's not using native strings, of course), using for example fs::u8path as you suggested.

In this way, issues like yours should be fixed: items with non-ASCII names would be extracted correctly, since internally bit7z uses UTF-8.
There might still be some issues though. For example, a FileCallback of the extractor might report the extracted file name using UTF-8 or the system codepage, and in either way the library's user might have some issues in using/printing/decoding it.

But at the moment I can't come up with any better solution than this 😅.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

Can I use bit7z::to_string() to write cross-platform code while using BIT7Z_USE_NATIVE_STRINGS on Windows? I am trying that and the code appears correct but I get linker errors. For example, for setting the password I do this:

writer.setPassword(bit7z::to_tstring(password));

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Can I use bit7z::to_string() to write cross-platform code while using BIT7Z_USE_NATIVE_STRINGS on Windows? I am trying that and the code appears correct but I get linker errors. For example, for setting the password I do this:

writer.setPassword(bit7z::to_tstring(password));

bit7z::to_tstring() is a template function for converting numeric values to a bit7z::tstring. It is basically an alias to either std::to_string or std::to_wstring according to whether BIT7Z_USE_NATIVE_STRING is enabled or not.
So unfortunately no, it can't be used for converting between string types.

I could make the bit7z's string conversion functions available in the public API, but at that point it would be the same as using bit7z without enabling BIT7Z_USE_NATIVE_STRING.

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

I'm just trying to figure out some way to write cross-platform code that works on both Linux and Windows and that properly handles UTF-8 on Windows. Since std::string must be converted to std::wstring (for Windows), I'm thinking this is not possible.

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

I think the best way to write cross-platform code for both Linux and Windows is not to use BIT7Z_USE_NATIVE_STRING.

On the release/v4.0.0 branch, I'm working on some possible improvements on how bit7z handles string encodings.

In particular, I added the possibility for the user to enable using the system codepage for string conversions (option BIT7Z_USE_SYSTEM_CODEPAGE) rather than UTF-8.

I'm also working on avoiding any unnecessary string conversions in the internals of bit7z.
For example, I fixed that line of code in processedItem.cpp without having to use u8path: I realized that a BitPropVariant string is already a wide string, so bit7z now uses that for initializing the fs::path object, avoiding any string conversion.

The main idea is to avoid conversions in the internal code of bit7z. In this way, for example, items with names having characters outside the system codepage should be extracted correctly, which was your original issue.

There might still be some issues at the API level, though, where a string conversion might still be needed (like in the setPassword method you are trying to use).
In this case, I wonder if there's much that bit7z can do to solve the problem.
std::string is agnostic concerning character encodings: it simply stores bytes as char values. It is how the program interprets those bytes that cause the issue.

If the password string is not UTF-8 encoded, it's probably using the default system codepage; hence, the BIT7Z_USE_SYSTEM_CODEPAGE option might be helpful for your use case.

On the other hand, you might have little control over the content of an archive (it might contain a file with a name having characters outside the codepage set). So when using something like the FileCallback for printing the name of the file being extracted, you might have some unexpected output since the system codepage cannot represent those characters (e.g., a Japanese character cannot be expressed in Windows-1252).
One possible solution for this latter issue might be something like the Win32 API function SetConsoleOutputCP(CP_UTF8), but I didn't test it.

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Just a quick update.
I delved further into the problem, and my conclusion is that while it is possible to write/print UTF-8 strings when the Windows codepage is not UTF-8, reading a UTF-8 string input (e.g., from the console) is basically impossible.
I found this blog post that seems to confirm this: https://nullprogram.com/blog/2020/05/04/.

In the same post, the author suggests how to deal with the issue when, for example, you need to read a user's password from the console, which I think might be close to your use case.
In short, the best way to do this is to read a wide string (UTF-16) from the console using ReadConsoleW and then convert it to UTF-8.

So I just pushed a commit (0e1fc9a) to the v4 branch, which introduces another to_string function that can be used for converting from wide strings to tstring.
I hope it can help you in your use case!

from bit7z.

ra-dave avatar ra-dave commented on June 13, 2024

Just to be clear, my app already fully supports UTF-8 on both Linux and Windows. What am I hoping for is a simple way to make bit7z API calls on both platforms using one set of code that works on both platforms and which will properly handle UTF-8 on Windows.

We may have to resort to having two sets of code (one for Linux and one for Windows) for the parts that use bit7z (or have some kind of abstraction layer that takes care of the differences), It may even be as simple as doing our own #defines to handle the string versions, we would just like to get away from that and be able to write truly cross-platform code.

I'm also working on avoiding any unnecessary string conversions in the internals of bit7z.
For example, I fixed that line of code in processedItem.cpp without having to use u8path: I realized that a BitPropVariant string is already a wide string, so bit7z now uses that for initializing the fs::path object, avoiding any string conversion.

I just tried this and it seems to fix our issue. I will do more testing tomorrow. And I am not using BIT7Z_USE_NATIVE_STRING. I don't need to, right? Do you have any other places in the code that you will also be updating for this type of change?

Thanks so much for all the support!

from bit7z.

rikyoz avatar rikyoz commented on June 13, 2024

Just to be clear, my app already fully supports UTF-8 on both Linux and Windows. What am I hoping for is a simple way to make bit7z API calls on both platforms using one set of code that works on both platforms and which will properly handle UTF-8 on Windows.

We may have to resort to having two sets of code (one for Linux and one for Windows) for the parts that use bit7z (or have some kind of abstraction layer that takes care of the differences), It may even be as simple as doing our own #defines to handle the string versions, we would just like to get away from that and be able to write truly cross-platform code.

Sorry, I misunderstood! 😅
If your app fully supports UTF-8, you should not have any problem with the default settings of bit7z (i.e., without enabling BIT7Z_USE_NATIVE_STRING).
By default, all std::string are considered UTF-8 encoded by bit7z.

This was an intentional choice precisely to spare library users from having to deal with two different sets of code and simplify writing cross-platform code.

The BIT7Z_USE_NATIVE_STRING option is meant mainly for "non-cross-platform" projects that need to run only on Windows.

I just tried this and it seems to fix our issue. I will do more testing tomorrow.

Great, thanks!

And I am not using BIT7Z_USE_NATIVE_STRING. I don't need to, right?

No, you don't need it. The change only concerns the internals of bit7z, not its API.

Do you have any other places in the code that you will also be updating for this type of change?

I just pushed to the v4 branch some commits with changes of the same type (697707d and 9f32061). I also cleaned up a bit some internal code so that now many unnecessary string conversions are completely avoided.

Thanks so much for all the support!

You're welcome!
Thank you for using bit7z and for the feedback, I really appreciate it!

from bit7z.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.