Giter Site home page Giter Site logo

utf8.h's People

Contributors

alunegov avatar bitonic avatar boretrk avatar chloridite avatar codecat avatar curoles avatar etodd avatar f2404 avatar falsycat avatar fluks avatar guekka avatar gumichan01 avatar lrpereira avatar manish364824 avatar nairou avatar rouault avatar roxas232 avatar ruby0x1 avatar sheredom avatar timgates42 avatar usbac avatar warmwaffles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

utf8.h's Issues

provide a function to get the previous codepoint

As far as I can tell the library currently only provides a way to iterate over a byte sequence in one direction using utf8codepoint. It would be super useful to have a function to go the other way, too and possibly rename them to utf8next and utf8prev or something similar? I'd be down to add this if this is something that you'd merge!

Thanks for the library, it's very clean, lightweight and useful!

utf8upr/lwr size issues?

Hi, I was looking at the docs for utf8upr/lwr, and they don't seem to indicate what happens if the string passed to them doesn't have enough space for the new codepoints. I understand that letters may have different byte sizes in their upper/lowercase variants, so I was wondering whether utf8upr/lwr will allocate extra memory as required.

Looking at the code, though, it seems like they just call utf8catcodepoint, which AFAIK doesn't allocate additional memory. In fact, the size argument in that call is set to the size of the new codepoint, rather than the size of the buffer as it should be. Is this correct?

Support constexpr?

Hey, thanks for this library. I'd like to be able to use it in constexpr context.
Would be willing to add a utf8_constexpr macro, that would be used in contexts that would allow it?
I might be able to work on a PR, but I don't have a time estimate

Couple of thoughts

Hey, nice library! I am looking for utf-8 C string parsing and this fits the bill. I had a couple of thoughts after reading the code.

  • While you have dutifully re implemented string.h functions, some of them are considered harmful. Ex: utf8ncpy may not append a null terminating character. If you don't consider it too sanctimonious, supplying non-harmful functions and making people opt-in to the riskier ones (with an ifdef?) might be helpful. I suspect so, since safety is a concern of yours since you return void*.

For instance, a safer utf8ncpy function that guarantees a null terminator (possibly truncating last char) and returns boolean whether the string was truncated or not can be helpful, if certainly, not conformant with anything in string.h. Also, only filling one NULL character in at the end, because zeroing after termination is a waste of cycles. I have been using such a workhorse function for years.

  • Consider using restrict where available. This will shrink code size and reduce waits for memory accesses. Again, a non-conforming change in some applications since applying it will require parameters to not overlap, but that's going to be the case in real world situations anyway.
  • I would be happy to trap strange input parameters by defining an assert macro before including your header. For instance (and sorry to pick on utf8cpy), if the max number of elements is zero, you could check that in an assert which is a NO-OP unless defined by the caller prior to header inclusion.

Here is a snippet that lets you portably apply the restrict keyword if you're interested:

#if defined(__GNUC__) || defined(__clang__)
    #ifdef __cplusplus    
        #define utf8_restrict __restrict
    #else
        #if __STDC_VERSION__ >= 199901L
            #define utf8_restrict restrict
        #endif
    #endif
#elif defined(_MSC_VER) && (_MSC_VER >= 1400) /* vs2005 */
    #define utf8_restrict __restrict
#else
   #define utf8_restrict
#define

Not an issue

How to loop through a utf8 str like :

for (int =0;i<strlen(*char);i++) do_something_with *char[i]

Character iterating?

What do you suggest is the best way of iterating over codepoints using this library?

[feature] Add utf8ndup

This is my current implementation. I am using it to replace all of my strndups

#include <utf8.h>

void*
utf8ndup(const void* src, size_t n)
{
    const char* s = (const char*)src;
    char* c       = 0;

    // figure out how many bytes (including the terminator) we need to copy first
    size_t bytes = utf8size(src);

    c = (char*)malloc(n);

    if (0 == c) {
        // out of memory so we bail
        return 0;
    }

    bytes = 0;
    size_t i = 0;

    // copy src byte-by-byte into our new utf8 string
    while ('\0' != s[bytes] && i < n) {
        c[bytes] = s[bytes];
        bytes++;
        i++;
    }

    // append null terminating byte
    c[bytes] = '\0';
    return c;
}

I don't know if this is desirable. I am almost just half tempted to calloc an memcpy the results.

utf8ncpy writes n+1 bytes (buffer overflow)

Here is an example test case where I tell utf8ncpy to write at most 10 bytes, but results in all 11 bytes of the buffer being written. I first noticed this in a larger program when it triggered a stack check exception due to buffer overflow.

#include <string.h>
#include <stdio.h>
#include "utf8.h"

int main(int argc, char* argv[]) {
  char buffer[11];
  memset(buffer, 0xdd, 11);
  printf("%02x\n", buffer[10] & 0xff);

  utf8ncpy(buffer, "foo", 10);

  printf("%02x\n", buffer[10] & 0xff);
}

which I have compiled simply with clang main.c with clang version: Apple LLVM version 9.1.0 (clang-902.0.39.2)

I get the result of

dd
00

when I would expect

dd
dd

Bug in utf8valid

utf8valid will fail on utf8 file with 2 or more linebreaks in a row.

utf8ncat - size wraparound bug

Hello 👋! I think I found a small bug in utf8ncat when the function is executed with size_t n being 0.
The function will still write all remaining bytes to the dst buffer.

for example:

utf8_int8_t dst[12] = { 'h', 'e', 'l', 'l', 'o', '\0' };
const utf8_int8_t* src = "world";
utf8ncat(dst, src, 0);

// dst will be { 'h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd', '\0', '\0' };

If I am not mistaken it is because size_t being unsigned which causes the following --n to wraparound:

utf8.h/utf8.h

Line 631 in 89f6a43

} while (('\0' != *src) && (0 != --n));

I presume this is not defined behavior and that this is a bug.

utf8lwr() - handle accented upper case vowels

Currently utf8.h lowercase conversion only handles ASCII chars. It is not possible to handle accented uppercase and lowercase vowels like Á for example. For my specific purpose I would like to cover at least ÀÈÌÒÙ since they cover most Latin languages. For the time being I would also be OK monkey patching utf8.h myself. I suppose the change has to take place in here:

utf8.h/utf8.h

Line 1016 in 1ca34ec

if (('A' <= cp) && ('Z' >= cp)) {

Any advice on how to do it? (I'm not a great C coder unfortunately :-)

A question on casting

Well, this is not an issue, it is an elementary question/request.

You have said, "..... Having it as a void* forces a user to explicitly cast the utf8 string to char* such that the onus is on them not to break the code anymore!...."

I am not very strong on this.
Can you please give an example on how to do this casting in the code........or point me to any page where such example code is given.

Thanks

Some minor overflow bugs

Hi maintainers,

I did some minor analysis on your library using the KLEE symbolic execuiton engine, which at the core tries to explore all different execution paths in the software under analysis to find bugs. It's an academic tool and you can see it here: https://klee.github.io/
During this process I found several overflows in the code and I wanted to report them collectively, so that's the purpose of this issue.

The small example program I used is the following:

#include "utf8.h"


int
main(int argc, char **argv)
{

        char arr[10];
        klee_make_symbolic(arr, 10, "arr");
        klee_assume(arr[9] == '\0');

        char arr1[10];
        klee_make_symbolic(arr1, 10, "arr1");
        klee_assume(arr1[9] =='\0');

        void *arr_check = utf8valid(arr);
        void *arr1_check = utf8valid(arr1);
        if (arr_check != 0 && arr1_check != 0)
        {
                if (utf8ncasecmp(arr_check, arr1_check, 9) == 0)
                        return 1;
                return 0;
        }
        return 1;
}

The calls to klee_make_symbolic triggers KLEE to consider the values in the arr and arr1 buffers to be unknowns in an equation system. It's not really necessary to understand the details of this to understand the bugs that I am reporting, the main point is that the bugs I found essentially set arr and arr1 to be of certain characters, and then the execution of utf8ncasecmp will result in some memory-out-of-bounds access. Some of the bugs that I show here I have also analysed with address sanitiser to confirm the bugs.

If my code snippet above uses your library in an erroneous manner then please disregard the bugs.

Bugs:

Bug 1

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xf0\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 utf8codepoint at ./utf8.h:987
#1 utf8ncasecmp at ./utf8.h:507

Bug 2

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xe0\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:992
#1 in utf8ncasecmp at ./utf8.h:507

Bug 3

arr value: "\xf0\x00\x00\x01\xe0\x00\x01\xf0\xff\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:987
#1 in utf8ncasecmp at ./utf8.h:507

Bug 4

arr value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:984
#1 in utf8ncasecmp at ./utf8.h:507

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 6

arr value: "\xe1\x80\x00\xe0\x01\x1b\xf0\x00\x02\x00"
arr1 value: "\xf0\x01\x00\x00\xf0\x00\x01\x1b\xc2\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 7

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:468

Bug 8

arr value: "\xf1\x00\x00\x00\xe0\x03\x17\xe0\x01\x00"
arr1 value: "\xf1\x80\x00\x00\xc3\x17\xc1\x00\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#000002175 in utf8ncasecmp at ./utf8.h:481

Bug 9

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\xc0\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:470

Bug 10

arr value: "\xf1\x00\x00\x00\xc3\x00\xe0\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03 \xe0\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:481

Bug 11

arr value: "\xf1\x00\x00\x00\xc1\x10\xc1\x00\xf0\x00"
arr1 value: "\xf1\x80\x00\x00\xe0\x010\xe0\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:496

Way of removing malloc completely

At the moment, we can pass alloc_func_ptr into functions that need to allocate memory but the malloc path is still live and for bare metal platforms that don't have an actual malloc in the c lib is won't compile, would be nice if a define could remove the malloc call completely... Happy to fail if no malloc, I'll always be passing alloc_func_ptr.

For now I've just defined it out like so
#if UTF8_NO_STD_MALLOC
//No malloc, you must pass in alloc_func_ptr
assert(false);
#else
n = (utf8_int8_t *)malloc(bytes);
#endif

utf8ncpy incorrectly removes last valid codepoint

In the code:

utf8_int8_t *utf8ncpy(utf8_int8_t *utf8_restrict dst,
                      const utf8_int8_t *utf8_restrict src, size_t n) {
 ...
  for (check_index = index - 1;
       check_index > 0 && 0x80 == (0xc0 & d[check_index]); check_index--) {
    /* just moving the index */
  }

For code points >7F 0xc0 is valid mask for 1st byte, for rest it is 0x80 (https://en.wikipedia.org/wiki/UTF-8).

Consider string °¯\_(ツ)_/¯°

let's add a printout:

     /* just moving the index */
      printf("index:%lu byte:%x cond:%u\n", check_index, (unsigned)(unsigned char)d[check_index],
              0x80 == (0xc0 & d[check_index]));
copy index:0 byte:c2
copy index:1 byte:b0
copy index:2 byte:c2
copy index:3 byte:af
copy index:4 byte:5c
copy index:5 byte:5f
copy index:6 byte:28
copy index:7 byte:e3
copy index:8 byte:83
copy index:9 byte:84
copy index:10 byte:29
copy index:11 byte:5f
copy index:12 byte:2f
copy index:13 byte:c2
copy index:14 byte:af
copy index:15 byte:c2
copy index:16 byte:b0
copy index:17 byte:0
found null
index:16 byte:b0 cond:1
index:15 byte:c2 cond:0

Following after that code will chop last valid code point:

  if (check_index < index &&
      (index - check_index) < utf8codepointsize(d[check_index])) { //(17-15)=2 < utf8codepointsize=4
    index = check_index;
  }

FIX

Fix that worked for me: (index - check_index) < utf8codepointcalcsize(&d[check_index]))

The problem with using utf8codepointsize:

utf8_constexpr14_impl size_t utf8codepointsize(utf8_int32_t chr) {
  if (0 == ((utf8_int32_t)0xffffff80 & chr)) {
    return 1;
  } else if (0 == ((utf8_int32_t)0xfffff800 & chr)) {
    return 2;
  } else if (0 == ((utf8_int32_t)0xffff0000 & chr)) {
    return 3;
  } else { /* if (0 == ((int)0xffe00000 & chr)) { */
    return 4;
  }
}

is that c2 becomes ffffffc2 and none of the 0xffffxxxx & chr == 0

Possibility of dual-licensing?

utf8.h is currently published under the Unlicense, putting its work in the public domain. This is great, but there are open questions as to whether this is valid in all jurisdictions (Germany being the most famous example).

As such, would you be at all willing to consider dual-licensing this software under the Unlicense and another "fallback" license? The CC0 license is another public domain license with a clause for what should happen when the terms of the license are deemed invalid under local law. Alternatively, there exist other minimal OSI approved licenses (such as the MIT license, the ISC license and the BSD licenses) which are permissive. These typically require attribution from the user, but if the software were dual-licensed, it would be entirely their choice which license they want to use.

Absolutely no worries if this is too big an ask, just really want to be able to use this software in a more legally-watertight way.

Allow programmer specified allocator

Would be nice to maybe provide a way for the person including the utf8 code to be able to do something like

#define utf8malloc my_alloc
#include "utf8.h"

I would make a PR to do this, but I don't know if it would be anything anyone is interested in.

grapheme support

Is there any functions like utf8codepoint and utf8rcodepoint but for grapheme ?

utf8valid with size

Is there a way to call utf8valid with string which is not null-terminated?

utf8makevalid read out of bounds (+ other functions)

Hello,
It seems to me that utf8makevalid can read string to modify out of bounds :

while ('\0' != *read) {
if (0xf0 == (0xf8 & read)) {
/
ensure each of the 3 following bytes in this 4-byte
* utf8 codepoint began with 0b10xxxxxx */
if ((0x80 != (0xc0 & read[1])) || (0x80 != (0xc0 & read[2])) ||
(0x80 != (0xc0 & read[3]))) {

=> it seems to me that we cannot be sure that read[1], [2] and [3] are not of bounds.

Regards,

PS : same problem in utf8codepoint and maybe other functions, but this is particularly important for utf8makevalid , because I can have any invaldi string as an input

clang-format?

While working on #92, I noticed there was no clang format file provided with utf8.h
I tried to respect the formatting, but it's hard to be always consistent
Providing a clang format file would allow contributors to follow the repo style guide easily

utf8rchr issue

Hello, I think I have found a bug in the utf8rchr code. On some occasions it skips past the null terminator of the string and continues reading until it finds the specified character. Running the following code under gcc produced the problem:

#include <stdio.h>
#include "utf8.h"

int main()
{
    char *s1 = "Hello";
    char *s2 = "Hello  ";
    char *result = utf8rchr(s1, 'o');

    printf("String pointer: %llx\n", s1);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n", result-s1);

    return 0;
}

This code produced the following output, indicating that the last occurrence of 'o' was at index 10 (it should be at 4), well past the end of the string.

Output from gcc:

String pointer: 7ff659d368d4
Char pointer  : 7ff659d368de
Index         : 10

Looking at the code for utf8rchr, I believe the problem code is where offset is being incremented by 2 (for a single byte ascii character) instead of 1, and skipping the Null terminator:

    while (src[offset] == c[offset]) {
      offset++;
    }

This doesn't occur on all occasions. I suspect that if the code encounters multiple Null characters before it finds another occurrence of the search character, it works okay.

tolower and toupper?

How does this work with utf8?

I can use tolower() and toupper() to get lowercase or uppercase characters of ascii chars, but there's nothing in this library for converting codepoints, from what I see.

utf8makevalid : test to identify sequence length and possible values not sufficient

Hello,

In utf8makevalid, you use the following test to identify a 4 sequence bytes

"if (0xf0 == (0xf8 & *read))"

This is not correct if you suppose that you can have any invalid string as an input parameter, since only a few values in f0-ff ranges are valid.

Moreover, for valid values in f0-ff ranges, possible values for second byte are not the same one. For example, with f0, valid range for second byte is 90..bf, instead of 80..bf

Regards

utf8tok and utf8tok_r

I've been playing with adding utf8tok but the problem with the original implementation is that it is not re-entrant.

I've been looking at musl at how they implemented utf8tok_r and it's relatively simple. here

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s += utf8spn(s, sep);
  if (!*s) {
    return *p = 0;
  }

  *p = s + utf8cspn(s, sep);
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

The following is the implemented test (it fails at the assert for föőf.

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aäáé|föőf|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

After playing with this for a bit, I am kind of at a loss for what to do.

Anyways, leaving this here in case someone else wants to pick it up and go on.

int might be too small for a code point

utf8chr uses an int type, which can be 16 bits wide, for chr code point argument. Maximum code point is 0x10FFFF which is more than 16 bits. To be more portable, chr argument's type should be something that is always large enough, maybe long if you don't want to include stdint.h.

strn*/utf8n* functions

The utf8len function returns codepoints instead of bytes, as expected, but it seems things like utf8ncmp continue to use bytes, which wasn't what I expected. Perhaps utf8ncmp could use n in codepoints too, and another utf8bcmp could use b bytes?

Not at all critical, since a work-around is easy, and I have no idea if others would want codepoint counting instead of bytes for the n functions. I needed an n codepoint compare, so I noticed this.

Why not use memcpy()?

    // copy src byte-by-byte into our new utf8 string
    while ('\0' != s[bytes]) {
      n[bytes] = s[bytes];
      bytes++;
    }

Copy string to limited buffer, without risking invalid result?

It looks like utf8cpy will copy the entire string, but makes an assumption about the destination being big enough, whereas utf8ncpy allows you to specify a destination buffer size limit, but risks creating an invalid result if the source string is longer.

I'm curious when this second result is ever desirable? If I'm working with utf8 strings, and I want to limit a string to a certain buffer size, shouldn't it crop the string at a valid code point?

`utf8nvalid` reads out bounds

The utf8nvalid procedure fails to respect the n parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the later str[2]:

      /* ensure that there's 2 bytes or more remained */
      if (remained < 2) {
        return (utf8_int8_t *)str;
      }

      /* ensure the 1 following byte in this 2-byte
       * utf8 codepoint began with 0b10xxxxxx */
      if (0x80 != (0xc0 & str[1])) {
        return (utf8_int8_t *)str;
      }

      /* ensure that our utf8 codepoint ended after 2 bytes */
      if (0x80 == (0xc0 & str[2])) {
        return (utf8_int8_t *)str;
      }

This fails in cases such as the following, where a string is unterminated:

#include <assert.h>
#include <string.h>
#include "utf8.h"

int main(int argc, char** argv) {
    const char terminated[] = "\xc2\xa3"; // UTF-8 encoding of U+00A3 (pound sign)
    size_t terminated_length = strlen(terminated);

    const char memory[] = "\xff\xff\xff\xff"
                          "\xc2\xa3"
                          "\x80\xff\xff\xff";

    const char* unterminated_begin = &memory[4];
    const char* unterminated_end = &memory[strlen(memory) - 4];
    size_t unterminated_length = unterminated_end - unterminated_begin;

    assert(terminated_length == unterminated_length);
    assert(strncmp(terminated, unterminated_begin, unterminated_length) == 0);
    // The two strings are identical within the bounds that are passed to
    // utf8nvalid, so we would expect these two tests to pass.
    assert(utf8nvalid(terminated, terminated_length) == NULL);
    assert(utf8nvalid(unterminated_begin, unterminated_length) == NULL); // fails!
}

Request for utf8makevalid() function in addition to utf8valid()

A common use case is that an application has to somehow work with the string provided even if it may have invalid sequences. This function would replace invalid utf8 sequences in a string with the specified valid utf8 character byte, ensuring that the output is valid utf8 and has the same total byte length as the input.

utf8ncpy incorrectly loops when it does not hit null-terminator

Following up on my issue from #50, utf8ncpy doesn't (now) correctly stop at n bytes unless it hits the null-terminator in src. The following code demonstrates the problem:

#include "utf8.h"

int main(int argc, char* argv[]) {
  char buffer[10];
  utf8ncpy(buffer, "foo", 2);
}

Running this program results in a segmentation fault for me, due to n looping round past 0.

Changing the 2 to 3 works, because the null-terminator is hit at the end of the string "foo".

conflicting int32_t definition

uint8.h detects _MSC_VER and #defines int32_t. The problem is that I've got another header that uses int32_t in a typedef, which ends up creating this:

typedef __int32 __int32;

which gives me a compiler error. Since stdint.h is included in Visual Studio 2010 and up, I propose changing the check for _MSC_VER to something like this:

#if (_MSC_VER < 1600)
#define int32_t __int32
#define uint32_t __uint32
#else
#include <stdint.h>
#endif

That way if we have stdint.h with visual studio we don't have to resort to #defines.

Bug in utf8len on malformed UTF-8 string

A malformed UTF-8 string will cause utf8len to read memory past str's null character. A buffer like this for example:

char str[] = { -16, '\0' };

The function needs additional checks if there is a null character somewhere not expected and then inform about an error in the str. Maybe it needs an error argument, set errno or return something else.

Invalid pointer returned when calling utf8codepoint function for a empty string

Sample code to reproduce the issue

const char * emptystr = u8"";
void * ret = utf8codepoint( (void*)emptystr,   &c); 

It is expected to return (void *) emptystr, but returns (void *) (emptystr+1).
The ret is now a bad index. It points to the address after the null terminator!

Suggest to add a null check at the beginning of the function, see below.


void *utf8codepoint(const void *utf8_restrict str,  utf8_int32_t *utf8_restrict out_codepoint) {
  const char *s = (const char *)str;

  // make sure a null string will alwaus return a fixed result, the pointer to str itself
  // without the check it could return an invalid posintion(s+x) which can result memory issue
  if ('\0' == *s) {
    return (void *)s;
  }

...

  return (void *)s;
}

More readible/maintainable tests

Hi

I was just looking through tests/main.c and I noticed that all the error codes are hard coded into the test functions themselves.

I'm not sure what the reason is for this, but my initial thought was that it would be better to have a header file define an enum that could contain descriptive error codes for all the functions.

Tedious and boring work, I know, but would make it easier to add new error codes and to reason about each test - assuming this wouldn't break something that relies on knowing the codes without being able to read an enum...

How to get codepoint of first character in iteration?

This code is following the example of the PR #21 to iterate. However, utf8codepoint is for getting the pointer and the codepoint of the next character. How can I get the codepoint of the first character?

utf8_char = utf8codepoint(utf8_string, &codepoint);
while (codepoint != '\0') {
	this_char = malloc(utf8codepointsize(codepoint) + 1);
	memset(this_char, 0, utf8codepointsize(codepoint) + 1);
	memcpy(this_char, utf8_char, utf8codepointsize(codepoint));
	printf("This char: %s\n", this_char);
	utf8_char = utf8codepoint(utf8_char, &codepoint);
}

Bug with utf8casecmp?

It seems utf8casecmp is not working correctly. I was trying to use it with std::set as a custom comparator. I compared it to strcasecmp and found it is not giving the same results for basic ASCII strings. Note I wouldn't expect the same values, but I would expect it to match negative or positive.

    printf("%d\n", strcasecmp(".gdoc", ".GSHeeT")); // -15
    printf("%d\n", utf8casecmp(".gdoc", ".GSHeeT")); // 1
    printf("%d\n", strcasecmp(".gsheet", ".gSLiDe")); // -4
    printf("%d\n", utf8casecmp(".gsheet", ".gSLiDe")); // 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.