sheredom / utf8.h Goto Github PK
View Code? Open in Web Editor NEW📚 single header utf8 string functions for C and C++
License: The Unlicense
📚 single header utf8 string functions for C and C++
License: The Unlicense
As far as I can tell the library currently only provides a way to iterate over a byte sequence in one direction using utf8codepoint
. It would be super useful to have a function to go the other way, too and possibly rename them to utf8next
and utf8prev
or something similar? I'd be down to add this if this is something that you'd merge!
Thanks for the library, it's very clean, lightweight and useful!
Hi, I was looking at the docs for utf8upr
/lwr
, and they don't seem to indicate what happens if the string passed to them doesn't have enough space for the new codepoints. I understand that letters may have different byte sizes in their upper/lowercase variants, so I was wondering whether utf8upr
/lwr
will allocate extra memory as required.
Looking at the code, though, it seems like they just call utf8catcodepoint
, which AFAIK doesn't allocate additional memory. In fact, the size
argument in that call is set to the size of the new codepoint, rather than the size of the buffer as it should be. Is this correct?
Hey, thanks for this library. I'd like to be able to use it in constexpr context.
Would be willing to add a utf8_constexpr
macro, that would be used in contexts that would allow it?
I might be able to work on a PR, but I don't have a time estimate
I need to add documentation for each of the functions into the readme, including example usages of each of the functions.
Hey, nice library! I am looking for utf-8 C string parsing and this fits the bill. I had a couple of thoughts after reading the code.
For instance, a safer utf8ncpy function that guarantees a null terminator (possibly truncating last char) and returns boolean whether the string was truncated or not can be helpful, if certainly, not conformant with anything in string.h. Also, only filling one NULL character in at the end, because zeroing after termination is a waste of cycles. I have been using such a workhorse function for years.
Here is a snippet that lets you portably apply the restrict keyword if you're interested:
#if defined(__GNUC__) || defined(__clang__)
#ifdef __cplusplus
#define utf8_restrict __restrict
#else
#if __STDC_VERSION__ >= 199901L
#define utf8_restrict restrict
#endif
#endif
#elif defined(_MSC_VER) && (_MSC_VER >= 1400) /* vs2005 */
#define utf8_restrict __restrict
#else
#define utf8_restrict
#define
It'd be nice to have a function for this. Perhaps a function that returns how many bytes a codepoint needs would be enough.
How to loop through a utf8 str like :
for (int =0;i<strlen(*char);i++) do_something_with *char[i]
Line 1159 in 89a1a24
The correct lower case for u+03f4 is u+03b8
https://www.compart.com/en/unicode/U+03F4
https://codepoints.net/U+03F4?lang=en
What do you suggest is the best way of iterating over codepoints using this library?
So I stumbled upon this: https://github.com/sheredom/utf8.h/blob/master/utf8.h#L697
Is this a crash waiting to happen, or am I reading the logic wrong?
If both h
and n
are equal strings, this will read past the memory of those strings.
This is my current implementation. I am using it to replace all of my strndup
s
#include <utf8.h>
void*
utf8ndup(const void* src, size_t n)
{
const char* s = (const char*)src;
char* c = 0;
// figure out how many bytes (including the terminator) we need to copy first
size_t bytes = utf8size(src);
c = (char*)malloc(n);
if (0 == c) {
// out of memory so we bail
return 0;
}
bytes = 0;
size_t i = 0;
// copy src byte-by-byte into our new utf8 string
while ('\0' != s[bytes] && i < n) {
c[bytes] = s[bytes];
bytes++;
i++;
}
// append null terminating byte
c[bytes] = '\0';
return c;
}
I don't know if this is desirable. I am almost just half tempted to calloc
an memcpy
the results.
Here is an example test case where I tell utf8ncpy
to write at most 10 bytes, but results in all 11 bytes of the buffer being written. I first noticed this in a larger program when it triggered a stack check exception due to buffer overflow.
#include <string.h>
#include <stdio.h>
#include "utf8.h"
int main(int argc, char* argv[]) {
char buffer[11];
memset(buffer, 0xdd, 11);
printf("%02x\n", buffer[10] & 0xff);
utf8ncpy(buffer, "foo", 10);
printf("%02x\n", buffer[10] & 0xff);
}
which I have compiled simply with clang main.c
with clang version: Apple LLVM version 9.1.0 (clang-902.0.39.2)
I get the result of
dd
00
when I would expect
dd
dd
Add a test similar to one used in #109
utf8valid will fail on utf8 file with 2 or more linebreaks in a row.
Hello 👋! I think I found a small bug in utf8ncat
when the function is executed with size_t n
being 0
.
The function will still write all remaining bytes to the dst
buffer.
for example:
utf8_int8_t dst[12] = { 'h', 'e', 'l', 'l', 'o', '\0' };
const utf8_int8_t* src = "world";
utf8ncat(dst, src, 0);
// dst will be { 'h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd', '\0', '\0' };
If I am not mistaken it is because size_t
being unsigned which causes the following --n
to wraparound:
Line 631 in 89f6a43
I presume this is not defined behavior and that this is a bug.
Currently utf8.h lowercase conversion only handles ASCII chars. It is not possible to handle accented uppercase and lowercase vowels like Á
for example. For my specific purpose I would like to cover at least ÀÈÌÒÙ
since they cover most Latin languages. For the time being I would also be OK monkey patching utf8.h myself. I suppose the change has to take place in here:
Line 1016 in 1ca34ec
Well, this is not an issue, it is an elementary question/request.
You have said, "..... Having it as a void* forces a user to explicitly cast the utf8 string to char* such that the onus is on them not to break the code anymore!...."
I am not very strong on this.
Can you please give an example on how to do this casting in the code........or point me to any page where such example code is given.
Thanks
Hi maintainers,
I did some minor analysis on your library using the KLEE symbolic execuiton engine, which at the core tries to explore all different execution paths in the software under analysis to find bugs. It's an academic tool and you can see it here: https://klee.github.io/
During this process I found several overflows in the code and I wanted to report them collectively, so that's the purpose of this issue.
The small example program I used is the following:
#include "utf8.h"
int
main(int argc, char **argv)
{
char arr[10];
klee_make_symbolic(arr, 10, "arr");
klee_assume(arr[9] == '\0');
char arr1[10];
klee_make_symbolic(arr1, 10, "arr1");
klee_assume(arr1[9] =='\0');
void *arr_check = utf8valid(arr);
void *arr1_check = utf8valid(arr1);
if (arr_check != 0 && arr1_check != 0)
{
if (utf8ncasecmp(arr_check, arr1_check, 9) == 0)
return 1;
return 0;
}
return 1;
}
The calls to klee_make_symbolic
triggers KLEE to consider the values in the arr
and arr1
buffers to be unknowns in an equation system. It's not really necessary to understand the details of this to understand the bugs that I am reporting, the main point is that the bugs I found essentially set arr
and arr1
to be of certain characters, and then the execution of utf8ncasecmp
will result in some memory-out-of-bounds access. Some of the bugs that I show here I have also analysed with address sanitiser to confirm the bugs.
If my code snippet above uses your library in an erroneous manner then please disregard the bugs.
arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xf0\x00
"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 utf8codepoint at ./utf8.h:987
#1 utf8ncasecmp at ./utf8.h:507
arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xe0\x00
"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:992
#1 in utf8ncasecmp at ./utf8.h:507
arr value: "\xf0\x00\x00\x01\xe0\x00\x01\xf0\xff\x00
"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xff\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:987
#1 in utf8ncasecmp at ./utf8.h:507
arr value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00
"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:984
#1 in utf8ncasecmp at ./utf8.h:507
arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00
"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494
arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00
"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494
arr value: "\xe1\x80\x00\xe0\x01\x1b\xf0\x00\x02\x00
"
arr1 value: "\xf0\x01\x00\x00\xf0\x00\x01\x1b\xc2\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494
arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\x00\x00
"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:468
arr value: "\xf1\x00\x00\x00\xe0\x03\x17\xe0\x01\x00
"
arr1 value: "\xf1\x80\x00\x00\xc3\x17\xc1\x00\xff\x00
"
Type: memory-out-of-bound:
Stack trace:
#000002175 in utf8ncasecmp at ./utf8.h:481
arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\xc0\x00
"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:470
arr value: "\xf1\x00\x00\x00\xc3\x00\xe0\x01\x00\x00
"
arr1 value: "\xf1\x00\x00\x00\xe0\x03 \xe0\x01\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:481
arr value: "\xf1\x00\x00\x00\xc1\x10\xc1\x00\xf0\x00
"
arr1 value: "\xf1\x80\x00\x00\xe0\x010\xe0\x01\x00
"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:496
At the moment, we can pass alloc_func_ptr into functions that need to allocate memory but the malloc path is still live and for bare metal platforms that don't have an actual malloc in the c lib is won't compile, would be nice if a define could remove the malloc call completely... Happy to fail if no malloc, I'll always be passing alloc_func_ptr.
For now I've just defined it out like so
#if UTF8_NO_STD_MALLOC
//No malloc, you must pass in alloc_func_ptr
assert(false);
#else
n = (utf8_int8_t *)malloc(bytes);
#endif
In the code:
utf8_int8_t *utf8ncpy(utf8_int8_t *utf8_restrict dst,
const utf8_int8_t *utf8_restrict src, size_t n) {
...
for (check_index = index - 1;
check_index > 0 && 0x80 == (0xc0 & d[check_index]); check_index--) {
/* just moving the index */
}
For code points >7F 0xc0
is valid mask for 1st byte, for rest it is 0x80
(https://en.wikipedia.org/wiki/UTF-8).
Consider string °¯\_(ツ)_/¯°
let's add a printout:
/* just moving the index */
printf("index:%lu byte:%x cond:%u\n", check_index, (unsigned)(unsigned char)d[check_index],
0x80 == (0xc0 & d[check_index]));
copy index:0 byte:c2
copy index:1 byte:b0
copy index:2 byte:c2
copy index:3 byte:af
copy index:4 byte:5c
copy index:5 byte:5f
copy index:6 byte:28
copy index:7 byte:e3
copy index:8 byte:83
copy index:9 byte:84
copy index:10 byte:29
copy index:11 byte:5f
copy index:12 byte:2f
copy index:13 byte:c2
copy index:14 byte:af
copy index:15 byte:c2
copy index:16 byte:b0
copy index:17 byte:0
found null
index:16 byte:b0 cond:1
index:15 byte:c2 cond:0
Following after that code will chop last valid code point:
if (check_index < index &&
(index - check_index) < utf8codepointsize(d[check_index])) { //(17-15)=2 < utf8codepointsize=4
index = check_index;
}
Fix that worked for me: (index - check_index) < utf8codepointcalcsize(&d[check_index]))
The problem with using utf8codepointsize
:
utf8_constexpr14_impl size_t utf8codepointsize(utf8_int32_t chr) {
if (0 == ((utf8_int32_t)0xffffff80 & chr)) {
return 1;
} else if (0 == ((utf8_int32_t)0xfffff800 & chr)) {
return 2;
} else if (0 == ((utf8_int32_t)0xffff0000 & chr)) {
return 3;
} else { /* if (0 == ((int)0xffe00000 & chr)) { */
return 4;
}
}
is that c2
becomes ffffffc2
and none of the 0xffffxxxx & chr
== 0
utf8.h is currently published under the Unlicense, putting its work in the public domain. This is great, but there are open questions as to whether this is valid in all jurisdictions (Germany being the most famous example).
As such, would you be at all willing to consider dual-licensing this software under the Unlicense and another "fallback" license? The CC0 license is another public domain license with a clause for what should happen when the terms of the license are deemed invalid under local law. Alternatively, there exist other minimal OSI approved licenses (such as the MIT license, the ISC license and the BSD licenses) which are permissive. These typically require attribution from the user, but if the software were dual-licensed, it would be entirely their choice which license they want to use.
Absolutely no worries if this is too big an ask, just really want to be able to use this software in a more legally-watertight way.
Would be nice to maybe provide a way for the person including the utf8 code to be able to do something like
#define utf8malloc my_alloc
#include "utf8.h"
I would make a PR to do this, but I don't know if it would be anything anyone is interested in.
Is there any functions like utf8codepoint
and utf8rcodepoint
but for grapheme ?
Is there a way to call utf8valid
with string which is not null-terminated?
Hello,
It seems to me that utf8makevalid can read string to modify out of bounds :
while ('\0' != *read) {
if (0xf0 == (0xf8 & read)) {
/ ensure each of the 3 following bytes in this 4-byte
* utf8 codepoint began with 0b10xxxxxx */
if ((0x80 != (0xc0 & read[1])) || (0x80 != (0xc0 & read[2])) ||
(0x80 != (0xc0 & read[3]))) {
=> it seems to me that we cannot be sure that read[1], [2] and [3] are not of bounds.
Regards,
PS : same problem in utf8codepoint and maybe other functions, but this is particularly important for utf8makevalid , because I can have any invaldi string as an input
While working on #92, I noticed there was no clang format file provided with utf8.h
I tried to respect the formatting, but it's hard to be always consistent
Providing a clang format file would allow contributors to follow the repo style guide easily
I would like to point out that an identifier like "__UTF8_H__
" does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
Hello, I think I have found a bug in the utf8rchr code. On some occasions it skips past the null terminator of the string and continues reading until it finds the specified character. Running the following code under gcc produced the problem:
#include <stdio.h>
#include "utf8.h"
int main()
{
char *s1 = "Hello";
char *s2 = "Hello ";
char *result = utf8rchr(s1, 'o');
printf("String pointer: %llx\n", s1);
printf("Char pointer : %llx\n", result);
printf("Index : %d\n", result-s1);
return 0;
}
This code produced the following output, indicating that the last occurrence of 'o' was at index 10 (it should be at 4), well past the end of the string.
Output from gcc:
String pointer: 7ff659d368d4
Char pointer : 7ff659d368de
Index : 10
Looking at the code for utf8rchr, I believe the problem code is where offset is being incremented by 2 (for a single byte ascii character) instead of 1, and skipping the Null terminator:
while (src[offset] == c[offset]) {
offset++;
}
This doesn't occur on all occasions. I suspect that if the code encounters multiple Null characters before it finds another occurrence of the search character, it works okay.
How does this work with utf8?
I can use tolower()
and toupper()
to get lowercase or uppercase characters of ascii chars, but there's nothing in this library for converting codepoints, from what I see.
Hello,
In utf8makevalid, you use the following test to identify a 4 sequence bytes
"if (0xf0 == (0xf8 & *read))"
This is not correct if you suppose that you can have any invalid string as an input parameter, since only a few values in f0-ff ranges are valid.
Moreover, for valid values in f0-ff ranges, possible values for second byte are not the same one. For example, with f0, valid range for second byte is 90..bf, instead of 80..bf
Regards
// we are including the null terminating byte in the size calculation
size++;
I've been playing with adding utf8tok
but the problem with the original implementation is that it is not re-entrant.
I've been looking at musl at how they implemented utf8tok_r
and it's relatively simple. here
void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
char* s = (char*) str;
char** p = (char**) ptr;
if (!s && !(s = *p)) {
return NULL;
}
s += utf8spn(s, sep);
if (!*s) {
return *p = 0;
}
*p = s + utf8cspn(s, sep);
if (**p) {
*(*p)++ = 0;
} else {
*p = 0;
}
return s;
}
The following is the implemented test (it fails at the assert for föőf
.
UTEST(utf8tok_r, token_walking) {
char* string = utf8dup("this|aäáé|föőf|that|");
char* ptr = NULL;
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));
free(string);
}
After playing with this for a bit, I am kind of at a loss for what to do.
Anyways, leaving this here in case someone else wants to pick it up and go on.
Unicode characters can have different visual widths, it would help if utf8.h had a builtin function to retrieve that.
A simple implementation can be found here. https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
simple usage of the above code with utf8.h
int codepoint;
void *v = utf8codepoint(text_ptr, &codepoint);
int w = mk_wcwidth((wchar_t)codepoint);
utf8chr
uses an int
type, which can be 16 bits wide, for chr
code point argument. Maximum code point is 0x10FFFF which is more than 16 bits. To be more portable, chr
argument's type should be something that is always large enough, maybe long
if you don't want to include stdint.h
.
The utf8len function returns codepoints instead of bytes, as expected, but it seems things like utf8ncmp continue to use bytes, which wasn't what I expected. Perhaps utf8ncmp could use n in codepoints too, and another utf8bcmp could use b bytes?
Not at all critical, since a work-around is easy, and I have no idea if others would want codepoint counting instead of bytes for the n functions. I needed an n codepoint compare, so I noticed this.
// copy src byte-by-byte into our new utf8 string
while ('\0' != s[bytes]) {
n[bytes] = s[bytes];
bytes++;
}
It looks like utf8cpy
will copy the entire string, but makes an assumption about the destination being big enough, whereas utf8ncpy
allows you to specify a destination buffer size limit, but risks creating an invalid result if the source string is longer.
I'm curious when this second result is ever desirable? If I'm working with utf8 strings, and I want to limit a string to a certain buffer size, shouldn't it crop the string at a valid code point?
In line 399 of utf8.h, function size_t utf8size(const void *str)
is redeclared, causing a compile-time error (using GCC 9.1).
First declaration is in line 162.
The utf8nvalid
procedure fails to respect the n
parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the later str[2]
:
/* ensure that there's 2 bytes or more remained */
if (remained < 2) {
return (utf8_int8_t *)str;
}
/* ensure the 1 following byte in this 2-byte
* utf8 codepoint began with 0b10xxxxxx */
if (0x80 != (0xc0 & str[1])) {
return (utf8_int8_t *)str;
}
/* ensure that our utf8 codepoint ended after 2 bytes */
if (0x80 == (0xc0 & str[2])) {
return (utf8_int8_t *)str;
}
This fails in cases such as the following, where a string is unterminated:
#include <assert.h>
#include <string.h>
#include "utf8.h"
int main(int argc, char** argv) {
const char terminated[] = "\xc2\xa3"; // UTF-8 encoding of U+00A3 (pound sign)
size_t terminated_length = strlen(terminated);
const char memory[] = "\xff\xff\xff\xff"
"\xc2\xa3"
"\x80\xff\xff\xff";
const char* unterminated_begin = &memory[4];
const char* unterminated_end = &memory[strlen(memory) - 4];
size_t unterminated_length = unterminated_end - unterminated_begin;
assert(terminated_length == unterminated_length);
assert(strncmp(terminated, unterminated_begin, unterminated_length) == 0);
// The two strings are identical within the bounds that are passed to
// utf8nvalid, so we would expect these two tests to pass.
assert(utf8nvalid(terminated, terminated_length) == NULL);
assert(utf8nvalid(unterminated_begin, unterminated_length) == NULL); // fails!
}
A common use case is that an application has to somehow work with the string provided even if it may have invalid sequences. This function would replace invalid utf8 sequences in a string with the specified valid utf8 character byte, ensuring that the output is valid utf8 and has the same total byte length as the input.
Following up on my issue from #50, utf8ncpy
doesn't (now) correctly stop at n
bytes unless it hits the null-terminator in src
. The following code demonstrates the problem:
#include "utf8.h"
int main(int argc, char* argv[]) {
char buffer[10];
utf8ncpy(buffer, "foo", 2);
}
Running this program results in a segmentation fault for me, due to n
looping round past 0.
Changing the 2
to 3
works, because the null-terminator is hit at the end of the string "foo"
.
C++ committee decided to introduce char8_t type that assumes holding utf-8 encoded character on C++20.
P0482R5: char8_t: A type for UTF-8 characters and strings (Revision 5)
uint8.h detects _MSC_VER and #defines int32_t. The problem is that I've got another header that uses int32_t in a typedef, which ends up creating this:
typedef __int32 __int32;
which gives me a compiler error. Since stdint.h is included in Visual Studio 2010 and up, I propose changing the check for _MSC_VER to something like this:
#if (_MSC_VER < 1600)
#define int32_t __int32
#define uint32_t __uint32
#else
#include <stdint.h>
#endif
That way if we have stdint.h with visual studio we don't have to resort to #defines.
A malformed UTF-8 string will cause utf8len
to read memory past str's
null character. A buffer like this for example:
char str[] = { -16, '\0' };
The function needs additional checks if there is a null character somewhere not expected and then inform about an error in the str
. Maybe it needs an error argument, set errno
or return something else.
Sample code to reproduce the issue
const char * emptystr = u8"";
void * ret = utf8codepoint( (void*)emptystr, &c);
It is expected to return (void *) emptystr, but returns (void *) (emptystr+1).
The ret is now a bad index. It points to the address after the null terminator!
Suggest to add a null check at the beginning of the function, see below.
void *utf8codepoint(const void *utf8_restrict str, utf8_int32_t *utf8_restrict out_codepoint) {
const char *s = (const char *)str;
// make sure a null string will alwaus return a fixed result, the pointer to str itself
// without the check it could return an invalid posintion(s+x) which can result memory issue
if ('\0' == *s) {
return (void *)s;
}
...
return (void *)s;
}
Hi
I was just looking through tests/main.c and I noticed that all the error codes are hard coded into the test functions themselves.
I'm not sure what the reason is for this, but my initial thought was that it would be better to have a header file define an enum that could contain descriptive error codes for all the functions.
Tedious and boring work, I know, but would make it easier to add new error codes and to reason about each test - assuming this wouldn't break something that relies on knowing the codes without being able to read an enum...
This code is following the example of the PR #21 to iterate. However, utf8codepoint
is for getting the pointer and the codepoint of the next character. How can I get the codepoint of the first character?
utf8_char = utf8codepoint(utf8_string, &codepoint);
while (codepoint != '\0') {
this_char = malloc(utf8codepointsize(codepoint) + 1);
memset(this_char, 0, utf8codepointsize(codepoint) + 1);
memcpy(this_char, utf8_char, utf8codepointsize(codepoint));
printf("This char: %s\n", this_char);
utf8_char = utf8codepoint(utf8_char, &codepoint);
}
It seems utf8casecmp is not working correctly. I was trying to use it with std::set as a custom comparator. I compared it to strcasecmp
and found it is not giving the same results for basic ASCII strings. Note I wouldn't expect the same values, but I would expect it to match negative or positive.
printf("%d\n", strcasecmp(".gdoc", ".GSHeeT")); // -15
printf("%d\n", utf8casecmp(".gdoc", ".GSHeeT")); // 1
printf("%d\n", strcasecmp(".gsheet", ".gSLiDe")); // -4
printf("%d\n", utf8casecmp(".gsheet", ".gSLiDe")); // 1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.