sheredom / utf8.h Goto Github PK

View Code? Open in Web Editor NEW

1.6K 49.0 121.0 243 KB

📚 single header utf8 string functions for C and C++

License: The Unlicense

CMake 2.37% C 54.17% C++ 43.46%

utf8 unicode c cpp

utf8.h's People

Contributors

Stargazers

Watchers

Forkers

etodd fluks zhangjinde codecat rayyee intel8086 ezhangle rbnx digiant 4144 panchangtao linecode skyformat99 blazingkin warmwaffles mojofreem jzfamily somebodymodest vliuchao veristack bbants marcioalmada gardhr dendisuhubdy salom12 usr007 mewbak kelas roxas232 erthink conr2d bitonic p-brain adriannostromo samantha-payson zrzppp hanaimin pirater squidfarts mokafolio zleq ubuntu-repo sg777 c-and-cpp-libraries bladewayne 63890390 5455945 henesy danielschuette gerbenvoshol 5l1v3r1 wuqunyong gurdji forksnd yingnierxiao digitiqman crackercat magastzheng manish364824 centrix14 zhangjiewen externalrepositories falsycat shards-lang limaaron oir onlylovesy liaofei1128 ilgrim hubean el3ph ddisqq maxlisui zzmcdc allenk godones watch-later parsa011 boretrk ajunlonglive njdragonfly boribs jleni xyz24110189 pytool clayne putao520 gabinou gmh5225 program-learning scott91e1 clownsw dosworld devdorhee ebell495 pkafma-aon denisritchie wiscada vbirds albertogp

utf8.h's Issues

provide a function to get the previous codepoint

As far as I can tell the library currently only provides a way to iterate over a byte sequence in one direction using utf8codepoint. It would be super useful to have a function to go the other way, too and possibly rename them to utf8next and utf8prev or something similar? I'd be down to add this if this is something that you'd merge!

Thanks for the library, it's very clean, lightweight and useful!

utf8upr/lwr size issues?

Hi, I was looking at the docs for utf8upr/lwr, and they don't seem to indicate what happens if the string passed to them doesn't have enough space for the new codepoints. I understand that letters may have different byte sizes in their upper/lowercase variants, so I was wondering whether utf8upr/lwr will allocate extra memory as required.

Looking at the code, though, it seems like they just call utf8catcodepoint, which AFAIK doesn't allocate additional memory. In fact, the size argument in that call is set to the size of the new codepoint, rather than the size of the buffer as it should be. Is this correct?

Support constexpr?

Hey, thanks for this library. I'd like to be able to use it in constexpr context.
Would be willing to add a utf8_constexpr macro, that would be used in contexts that would allow it?
I might be able to work on a PR, but I don't have a time estimate

Add documentation to the README.md

I need to add documentation for each of the functions into the readme, including example usages of each of the functions.

Couple of thoughts

Hey, nice library! I am looking for utf-8 C string parsing and this fits the bill. I had a couple of thoughts after reading the code.

While you have dutifully re implemented string.h functions, some of them are considered harmful. Ex: utf8ncpy may not append a null terminating character. If you don't consider it too sanctimonious, supplying non-harmful functions and making people opt-in to the riskier ones (with an ifdef?) might be helpful. I suspect so, since safety is a concern of yours since you return void*.

For instance, a safer utf8ncpy function that guarantees a null terminator (possibly truncating last char) and returns boolean whether the string was truncated or not can be helpful, if certainly, not conformant with anything in string.h. Also, only filling one NULL character in at the end, because zeroing after termination is a waste of cycles. I have been using such a workhorse function for years.

Consider using restrict where available. This will shrink code size and reduce waits for memory accesses. Again, a non-conforming change in some applications since applying it will require parameters to not overlap, but that's going to be the case in real world situations anyway.
I would be happy to trap strange input parameters by defining an assert macro before including your header. For instance (and sorry to pick on utf8cpy), if the max number of elements is zero, you could check that in an assert which is a NO-OP unless defined by the caller prior to header inclusion.

Here is a snippet that lets you portably apply the restrict keyword if you're interested:

#if defined(__GNUC__) || defined(__clang__)
    #ifdef __cplusplus    
        #define utf8_restrict __restrict
    #else
        #if __STDC_VERSION__ >= 199901L
            #define utf8_restrict restrict
        #endif
    #endif
#elif defined(_MSC_VER) && (_MSC_VER >= 1400) /* vs2005 */
    #define utf8_restrict __restrict
#else
   #define utf8_restrict
#define

Procedure for adding a single codepoint to an existing string

It'd be nice to have a function for this. Perhaps a function that returns how many bytes a codepoint needs would be enough.

Not an issue

How to loop through a utf8 str like :

for (int =0;i<strlen(*char);i++) do_something_with *char[i]

utf8lwrcodepoint for "Greek Capital Theta Symbol" incorrect

utf8.h/utf8.h

Line 1159 in 89a1a24

case 0x03f4: cp = 0x03d1; break;

The correct lower case for u+03f4 is u+03b8

https://www.compart.com/en/unicode/U+03F4
https://codepoints.net/U+03F4?lang=en

Character iterating?

What do you suggest is the best way of iterating over codepoints using this library?

Crash waiting to happen?

So I stumbled upon this: https://github.com/sheredom/utf8.h/blob/master/utf8.h#L697

Is this a crash waiting to happen, or am I reading the logic wrong?

If both h and n are equal strings, this will read past the memory of those strings.

support utf8 convert to utf16?

[feature] Add utf8ndup

This is my current implementation. I am using it to replace all of my strndups

#include <utf8.h>

void*
utf8ndup(const void* src, size_t n)
{
    const char* s = (const char*)src;
    char* c       = 0;

    // figure out how many bytes (including the terminator) we need to copy first
    size_t bytes = utf8size(src);

    c = (char*)malloc(n);

    if (0 == c) {
        // out of memory so we bail
        return 0;
    }

    bytes = 0;
    size_t i = 0;

    // copy src byte-by-byte into our new utf8 string
    while ('\0' != s[bytes] && i < n) {
        c[bytes] = s[bytes];
        bytes++;
        i++;
    }

    // append null terminating byte
    c[bytes] = '\0';
    return c;
}

I don't know if this is desirable. I am almost just half tempted to calloc an memcpy the results.

utf8ncpy writes n+1 bytes (buffer overflow)

Here is an example test case where I tell utf8ncpy to write at most 10 bytes, but results in all 11 bytes of the buffer being written. I first noticed this in a larger program when it triggered a stack check exception due to buffer overflow.

#include <string.h>
#include <stdio.h>
#include "utf8.h"

int main(int argc, char* argv[]) {
  char buffer[11];
  memset(buffer, 0xdd, 11);
  printf("%02x\n", buffer[10] & 0xff);

  utf8ncpy(buffer, "foo", 10);

  printf("%02x\n", buffer[10] & 0xff);
}

which I have compiled simply with clang main.c with clang version: Apple LLVM version 9.1.0 (clang-902.0.39.2)

I get the result of

dd
00

when I would expect

dd
dd

Add test similar to one used in issue 109

Add a test similar to one used in #109

Bug in utf8valid

utf8valid will fail on utf8 file with 2 or more linebreaks in a row.

utf8ncat - size wraparound bug

Hello 👋! I think I found a small bug in utf8ncat when the function is executed with size_t n being 0.
The function will still write all remaining bytes to the dst buffer.

for example:

utf8_int8_t dst[12] = { 'h', 'e', 'l', 'l', 'o', '\0' };
const utf8_int8_t* src = "world";
utf8ncat(dst, src, 0);

// dst will be { 'h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd', '\0', '\0' };

If I am not mistaken it is because size_t being unsigned which causes the following --n to wraparound:

utf8.h/utf8.h

Line 631 in 89f6a43

} while (('\0' != *src) && (0 != --n));

I presume this is not defined behavior and that this is a bug.

utf8lwr() - handle accented upper case vowels

Currently utf8.h lowercase conversion only handles ASCII chars. It is not possible to handle accented uppercase and lowercase vowels like Á for example. For my specific purpose I would like to cover at least ÀÈÌÒÙ since they cover most Latin languages. For the time being I would also be OK monkey patching utf8.h myself. I suppose the change has to take place in here:

utf8.h/utf8.h

Line 1016 in 1ca34ec

if (('A' <= cp) && ('Z' >= cp)) {

Any advice on how to do it? (I'm not a great C coder unfortunately :-)

A question on casting

Well, this is not an issue, it is an elementary question/request.

You have said, "..... Having it as a void* forces a user to explicitly cast the utf8 string to char* such that the onus is on them not to break the code anymore!...."

I am not very strong on this.
Can you please give an example on how to do this casting in the code........or point me to any page where such example code is given.

Thanks

Some minor overflow bugs

Hi maintainers,

I did some minor analysis on your library using the KLEE symbolic execuiton engine, which at the core tries to explore all different execution paths in the software under analysis to find bugs. It's an academic tool and you can see it here: https://klee.github.io/
During this process I found several overflows in the code and I wanted to report them collectively, so that's the purpose of this issue.

The small example program I used is the following:

#include "utf8.h"


int
main(int argc, char **argv)
{

        char arr[10];
        klee_make_symbolic(arr, 10, "arr");
        klee_assume(arr[9] == '\0');

        char arr1[10];
        klee_make_symbolic(arr1, 10, "arr1");
        klee_assume(arr1[9] =='\0');

        void *arr_check = utf8valid(arr);
        void *arr1_check = utf8valid(arr1);
        if (arr_check != 0 && arr1_check != 0)
        {
                if (utf8ncasecmp(arr_check, arr1_check, 9) == 0)
                        return 1;
                return 0;
        }
        return 1;
}

The calls to klee_make_symbolic triggers KLEE to consider the values in the arr and arr1 buffers to be unknowns in an equation system. It's not really necessary to understand the details of this to understand the bugs that I am reporting, the main point is that the bugs I found essentially set arr and arr1 to be of certain characters, and then the execution of utf8ncasecmp will result in some memory-out-of-bounds access. Some of the bugs that I show here I have also analysed with address sanitiser to confirm the bugs.

If my code snippet above uses your library in an erroneous manner then please disregard the bugs.

Bugs:

Bug 1

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xf0\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 utf8codepoint at ./utf8.h:987
#1 utf8ncasecmp at ./utf8.h:507

Bug 2

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xe0\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:992
#1 in utf8ncasecmp at ./utf8.h:507

Bug 3

arr value: "\xf0\x00\x00\x01\xe0\x00\x01\xf0\xff\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:987
#1 in utf8ncasecmp at ./utf8.h:507

Bug 4

arr value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:984
#1 in utf8ncasecmp at ./utf8.h:507

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 6

arr value: "\xe1\x80\x00\xe0\x01\x1b\xf0\x00\x02\x00"
arr1 value: "\xf0\x01\x00\x00\xf0\x00\x01\x1b\xc2\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 7

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:468

Bug 8

arr value: "\xf1\x00\x00\x00\xe0\x03\x17\xe0\x01\x00"
arr1 value: "\xf1\x80\x00\x00\xc3\x17\xc1\x00\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#000002175 in utf8ncasecmp at ./utf8.h:481

Bug 9

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\xc0\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:470

Bug 10

arr value: "\xf1\x00\x00\x00\xc3\x00\xe0\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03 \xe0\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:481

Bug 11

arr value: "\xf1\x00\x00\x00\xc1\x10\xc1\x00\xf0\x00"
arr1 value: "\xf1\x80\x00\x00\xe0\x010\xe0\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:496

Way of removing malloc completely

At the moment, we can pass alloc_func_ptr into functions that need to allocate memory but the malloc path is still live and for bare metal platforms that don't have an actual malloc in the c lib is won't compile, would be nice if a define could remove the malloc call completely... Happy to fail if no malloc, I'll always be passing alloc_func_ptr.

For now I've just defined it out like so
#if UTF8_NO_STD_MALLOC
//No malloc, you must pass in alloc_func_ptr
assert(false);
#else
n = (utf8_int8_t *)malloc(bytes);
#endif

utf8ncpy incorrectly removes last valid codepoint

In the code:

utf8_int8_t *utf8ncpy(utf8_int8_t *utf8_restrict dst,
                      const utf8_int8_t *utf8_restrict src, size_t n) {
 ...
  for (check_index = index - 1;
       check_index > 0 && 0x80 == (0xc0 & d[check_index]); check_index--) {
    /* just moving the index */
  }

For code points >7F 0xc0 is valid mask for 1st byte, for rest it is 0x80 (https://en.wikipedia.org/wiki/UTF-8).

Consider string °¯\_(ツ)_/¯°

let's add a printout:

     /* just moving the index */
      printf("index:%lu byte:%x cond:%u\n", check_index, (unsigned)(unsigned char)d[check_index],
              0x80 == (0xc0 & d[check_index]));

copy index:0 byte:c2
copy index:1 byte:b0
copy index:2 byte:c2
copy index:3 byte:af
copy index:4 byte:5c
copy index:5 byte:5f
copy index:6 byte:28
copy index:7 byte:e3
copy index:8 byte:83
copy index:9 byte:84
copy index:10 byte:29
copy index:11 byte:5f
copy index:12 byte:2f
copy index:13 byte:c2
copy index:14 byte:af
copy index:15 byte:c2
copy index:16 byte:b0
copy index:17 byte:0
found null
index:16 byte:b0 cond:1
index:15 byte:c2 cond:0

Following after that code will chop last valid code point:

  if (check_index < index &&
      (index - check_index) < utf8codepointsize(d[check_index])) { //(17-15)=2 < utf8codepointsize=4
    index = check_index;
  }

FIX

Fix that worked for me: (index - check_index) < utf8codepointcalcsize(&d[check_index]))

The problem with using utf8codepointsize:

utf8_constexpr14_impl size_t utf8codepointsize(utf8_int32_t chr) {
  if (0 == ((utf8_int32_t)0xffffff80 & chr)) {
    return 1;
  } else if (0 == ((utf8_int32_t)0xfffff800 & chr)) {
    return 2;
  } else if (0 == ((utf8_int32_t)0xffff0000 & chr)) {
    return 3;
  } else { /* if (0 == ((int)0xffe00000 & chr)) { */
    return 4;
  }
}

is that c2 becomes ffffffc2 and none of the 0xffffxxxx & chr == 0

Possibility of dual-licensing?

utf8.h is currently published under the Unlicense, putting its work in the public domain. This is great, but there are open questions as to whether this is valid in all jurisdictions (Germany being the most famous example).

As such, would you be at all willing to consider dual-licensing this software under the Unlicense and another "fallback" license? The CC0 license is another public domain license with a clause for what should happen when the terms of the license are deemed invalid under local law. Alternatively, there exist other minimal OSI approved licenses (such as the MIT license, the ISC license and the BSD licenses) which are permissive. These typically require attribution from the user, but if the software were dual-licensed, it would be entirely their choice which license they want to use.

Absolutely no worries if this is too big an ask, just really want to be able to use this software in a more legally-watertight way.

Allow programmer specified allocator

Would be nice to maybe provide a way for the person including the utf8 code to be able to do something like

#define utf8malloc my_alloc
#include "utf8.h"

I would make a PR to do this, but I don't know if it would be anything anyone is interested in.

grapheme support

Is there any functions like utf8codepoint and utf8rcodepoint but for grapheme ?

utf8valid with size

Is there a way to call utf8valid with string which is not null-terminated?

utf8makevalid read out of bounds (+ other functions)

Hello,
It seems to me that utf8makevalid can read string to modify out of bounds :

while ('\0' != *read) {
if (0xf0 == (0xf8 & read)) {
/ ensure each of the 3 following bytes in this 4-byte
* utf8 codepoint began with 0b10xxxxxx */
if ((0x80 != (0xc0 & read[1])) || (0x80 != (0xc0 & read[2])) ||
(0x80 != (0xc0 & read[3]))) {

=> it seems to me that we cannot be sure that read[1], [2] and [3] are not of bounds.

Regards,

PS : same problem in utf8codepoint and maybe other functions, but this is particularly important for utf8makevalid , because I can have any invaldi string as an input

clang-format?

While working on #92, I noticed there was no clang format file provided with utf8.h
I tried to respect the formatting, but it's hard to be always consistent
Providing a clang format file would allow contributors to follow the repo style guide easily

reserved identifier violation

I would like to point out that an identifier like "__UTF8_H__" does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

utf8rchr issue

Hello, I think I have found a bug in the utf8rchr code. On some occasions it skips past the null terminator of the string and continues reading until it finds the specified character. Running the following code under gcc produced the problem:

#include <stdio.h>
#include "utf8.h"

int main()
{
    char *s1 = "Hello";
    char *s2 = "Hello  ";
    char *result = utf8rchr(s1, 'o');

    printf("String pointer: %llx\n", s1);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n", result-s1);

    return 0;
}

This code produced the following output, indicating that the last occurrence of 'o' was at index 10 (it should be at 4), well past the end of the string.

Output from gcc:

String pointer: 7ff659d368d4
Char pointer  : 7ff659d368de
Index         : 10

Looking at the code for utf8rchr, I believe the problem code is where offset is being incremented by 2 (for a single byte ascii character) instead of 1, and skipping the Null terminator:

    while (src[offset] == c[offset]) {
      offset++;
    }

This doesn't occur on all occasions. I suspect that if the code encounters multiple Null characters before it finds another occurrence of the search character, it works okay.

tolower and toupper?

How does this work with utf8?

I can use tolower() and toupper() to get lowercase or uppercase characters of ascii chars, but there's nothing in this library for converting codepoints, from what I see.

utf8makevalid : test to identify sequence length and possible values not sufficient

Hello,

In utf8makevalid, you use the following test to identify a 4 sequence bytes

"if (0xf0 == (0xf8 & *read))"

This is not correct if you suppose that you can have any invalid string as an input parameter, since only a few values in f0-ff ranges are valid.

Moreover, for valid values in f0-ff ranges, possible values for second byte are not the same one. For example, with f0, valid range for second byte is 90..bf, instead of 80..bf

Regards

Why does the string length include the terminating char?

  // we are including the null terminating byte in the size calculation
  size++;

utf8tok and utf8tok_r

I've been playing with adding utf8tok but the problem with the original implementation is that it is not re-entrant.

I've been looking at musl at how they implemented utf8tok_r and it's relatively simple. here

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s += utf8spn(s, sep);
  if (!*s) {
    return *p = 0;
  }

  *p = s + utf8cspn(s, sep);
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

The following is the implemented test (it fails at the assert for föőf.

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aäáé|föőf|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

After playing with this for a bit, I am kind of at a loss for what to do.

Anyways, leaving this here in case someone else wants to pick it up and go on.

provide get codepoint visual width function

Unicode characters can have different visual widths, it would help if utf8.h had a builtin function to retrieve that.

A simple implementation can be found here. https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

simple usage of the above code with utf8.h

int codepoint;
void *v = utf8codepoint(text_ptr, &codepoint);
int w = mk_wcwidth((wchar_t)codepoint);

int might be too small for a code point

utf8chr uses an int type, which can be 16 bits wide, for chr code point argument. Maximum code point is 0x10FFFF which is more than 16 bits. To be more portable, chr argument's type should be something that is always large enough, maybe long if you don't want to include stdint.h.

strn/utf8n functions

The utf8len function returns codepoints instead of bytes, as expected, but it seems things like utf8ncmp continue to use bytes, which wasn't what I expected. Perhaps utf8ncmp could use n in codepoints too, and another utf8bcmp could use b bytes?

Not at all critical, since a work-around is easy, and I have no idea if others would want codepoint counting instead of bytes for the n functions. I needed an n codepoint compare, so I noticed this.

Why not use memcpy()?

    // copy src byte-by-byte into our new utf8 string
    while ('\0' != s[bytes]) {
      n[bytes] = s[bytes];
      bytes++;
    }

Copy string to limited buffer, without risking invalid result?

It looks like utf8cpy will copy the entire string, but makes an assumption about the destination being big enough, whereas utf8ncpy allows you to specify a destination buffer size limit, but risks creating an invalid result if the source string is longer.

I'm curious when this second result is ever desirable? If I'm working with utf8 strings, and I want to limit a string to a certain buffer size, shouldn't it crop the string at a valid code point?

utf8.h:399: redeclaration of method utf8size

In line 399 of utf8.h, function size_t utf8size(const void *str) is redeclared, causing a compile-time error (using GCC 9.1).
First declaration is in line 162.

`utf8nvalid` reads out bounds

The utf8nvalid procedure fails to respect the n parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the later str[2]:

      /* ensure that there's 2 bytes or more remained */
      if (remained < 2) {
        return (utf8_int8_t *)str;
      }

      /* ensure the 1 following byte in this 2-byte
       * utf8 codepoint began with 0b10xxxxxx */
      if (0x80 != (0xc0 & str[1])) {
        return (utf8_int8_t *)str;
      }

      /* ensure that our utf8 codepoint ended after 2 bytes */
      if (0x80 == (0xc0 & str[2])) {
        return (utf8_int8_t *)str;
      }

This fails in cases such as the following, where a string is unterminated:

#include <assert.h>
#include <string.h>
#include "utf8.h"

int main(int argc, char** argv) {
    const char terminated[] = "\xc2\xa3"; // UTF-8 encoding of U+00A3 (pound sign)
    size_t terminated_length = strlen(terminated);

    const char memory[] = "\xff\xff\xff\xff"
                          "\xc2\xa3"
                          "\x80\xff\xff\xff";

    const char* unterminated_begin = &memory[4];
    const char* unterminated_end = &memory[strlen(memory) - 4];
    size_t unterminated_length = unterminated_end - unterminated_begin;

    assert(terminated_length == unterminated_length);
    assert(strncmp(terminated, unterminated_begin, unterminated_length) == 0);
    // The two strings are identical within the bounds that are passed to
    // utf8nvalid, so we would expect these two tests to pass.
    assert(utf8nvalid(terminated, terminated_length) == NULL);
    assert(utf8nvalid(unterminated_begin, unterminated_length) == NULL); // fails!
}

Request for utf8makevalid() function in addition to utf8valid()

A common use case is that an application has to somehow work with the string provided even if it may have invalid sequences. This function would replace invalid utf8 sequences in a string with the specified valid utf8 character byte, ensuring that the output is valid utf8 and has the same total byte length as the input.

utf8ncpy incorrectly loops when it does not hit null-terminator

Following up on my issue from #50, utf8ncpy doesn't (now) correctly stop at n bytes unless it hits the null-terminator in src. The following code demonstrates the problem:

#include "utf8.h"

int main(int argc, char* argv[]) {
  char buffer[10];
  utf8ncpy(buffer, "foo", 2);
}

Running this program results in a segmentation fault for me, due to n looping round past 0.

Changing the 2 to 3 works, because the null-terminator is hit at the end of the string "foo".

Do you have plan to support char8_t?

C++ committee decided to introduce char8_t type that assumes holding utf-8 encoded character on C++20.
P0482R5: char8_t: A type for UTF-8 characters and strings (Revision 5)

conflicting int32_t definition

uint8.h detects _MSC_VER and #defines int32_t. The problem is that I've got another header that uses int32_t in a typedef, which ends up creating this:

typedef __int32 __int32;

which gives me a compiler error. Since stdint.h is included in Visual Studio 2010 and up, I propose changing the check for _MSC_VER to something like this:

#if (_MSC_VER < 1600)
#define int32_t __int32
#define uint32_t __uint32
#else
#include <stdint.h>
#endif

That way if we have stdint.h with visual studio we don't have to resort to #defines.

Bug in utf8len on malformed UTF-8 string

A malformed UTF-8 string will cause utf8len to read memory past str's null character. A buffer like this for example:

char str[] = { -16, '\0' };

The function needs additional checks if there is a null character somewhere not expected and then inform about an error in the str. Maybe it needs an error argument, set errno or return something else.

Invalid pointer returned when calling utf8codepoint function for a empty string

Sample code to reproduce the issue

const char * emptystr = u8"";
void * ret = utf8codepoint( (void*)emptystr,   &c);

It is expected to return (void *) emptystr, but returns (void *) (emptystr+1).
The ret is now a bad index. It points to the address after the null terminator!

Suggest to add a null check at the beginning of the function, see below.


void *utf8codepoint(const void *utf8_restrict str,  utf8_int32_t *utf8_restrict out_codepoint) {
  const char *s = (const char *)str;

  // make sure a null string will alwaus return a fixed result, the pointer to str itself
  // without the check it could return an invalid posintion(s+x) which can result memory issue
  if ('\0' == *s) {
    return (void *)s;
  }

...

  return (void *)s;
}

How to get codepoint of first character in iteration?

This code is following the example of the PR #21 to iterate. However, utf8codepoint is for getting the pointer and the codepoint of the next character. How can I get the codepoint of the first character?

utf8_char = utf8codepoint(utf8_string, &codepoint);
while (codepoint != '\0') {
	this_char = malloc(utf8codepointsize(codepoint) + 1);
	memset(this_char, 0, utf8codepointsize(codepoint) + 1);
	memcpy(this_char, utf8_char, utf8codepointsize(codepoint));
	printf("This char: %s\n", this_char);
	utf8_char = utf8codepoint(utf8_char, &codepoint);
}

Bug with utf8casecmp?

It seems utf8casecmp is not working correctly. I was trying to use it with std::set as a custom comparator. I compared it to strcasecmp and found it is not giving the same results for basic ASCII strings. Note I wouldn't expect the same values, but I would expect it to match negative or positive.

    printf("%d\n", strcasecmp(".gdoc", ".GSHeeT")); // -15
    printf("%d\n", utf8casecmp(".gdoc", ".GSHeeT")); // 1
    printf("%d\n", strcasecmp(".gsheet", ".gSLiDe")); // -4
    printf("%d\n", utf8casecmp(".gsheet", ".gSLiDe")); // 1

sheredom / utf8.h Goto Github PK

utf8.h's People

Contributors

Stargazers

Watchers

Forkers

utf8.h's Issues

Bugs:

Bug 1

Bug 2

Bug 3

Bug 4

Bug 5

Bug 5

Bug 6

Bug 7

Bug 8

Bug 9

Bug 10

Bug 11

FIX

Recommend Projects

Recommend Topics

Recommend Org