unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of wchar_t
, as long as it is at least 16 bits (if unsigned) or 17 bits (if signed).
Note
this is just a hobby project. as much as I try to fix issues, you should still probably not expect it to always work properly. also, the code isn't exactly the most optimized. you have my warning.
unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings:
- wide characters (
wchar_t
) are assumed to be encoded in UTF-32 ifWCHAR_MAX
is at least0x10FFFF
(e.g. Linux), or UTF-16 otherwise (e.g. Windows).- surrogates (
U+D800
-U+DFFF
) are considered invalid in UTF-32. - a new function (
mbstowc
) has been implemented as an alternative tombtowc
to allow converting individual non-BMP characters in UTF-16.
- surrogates (
- multibyte strings (used in
mbstowcs
and the like) are assumed to be encoded in UTF-8.- surrogates (
U+D800
-U+DFFF
) are considered invalid in multibyte strings. - characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed
U+10FFFF
. - overlong characters (characters encoded in a larger number of bytes than necessary) are considered invalid.
- surrogates (
everything that unicorn implements uses the same name as its counterpart in standard C, except with a UC_
prefix.
the only exception being the wchar_t
type. unicorn uses the standard wchar_t
.
unicorn is almost C89-compatible, except that it needs to know the maximum possible value of the wchar_t
type.
if your compiling environment does not support C99 or newer, then unless your compiler itself predefines WCHAR_MAX
, __WCHAR_MAX
, or __WCHAR_MAX__
, you need to manually define one of them during compile time (make sure to give it the correct value!).
-
the following will be implemented in a later update:
wcstok
function.
-
the following do not need to be implemented, because UTF-8 is stateless:
mbstate_t
type.mbsinit
function.- thread-safe versions of encoding conversion functions.
-
the following are not planned to be implemented any time soon (or maybe ever):
wctype_t
type.- character type functions (
towlower
,towupper
,wcscasecmp
,wcscasecmp_l
,wcsncasecmp
,wcsncasecmp_l
,wctype
, and theisw
family, includingiswctype
). - string to number conversion functions (
wcstol
,wcstoul
,wcstoll
,wcstoull
,wcstof
,wcstod
, andwcstold
). - functions that interact with file streams (e.g.
fgetws
,fputws
,wprintf
). wcscoll
andwcscoll_l
functions.wcsftime
function.wcsdup
function.wcwidth
andwcswidth
functions.wcsxfrm
andwcsxfrm_l
functions.
Important
you need to append a UC_
prefix to the names of these functions, types, and macros!
- every
wchar.h
function not mentioned above, including a few nonstandard POSIX-only functions, likewcpcpy
. wint_t
type (equivalent tosigned long int
), with range macrosWINT_MIN
andWINT_MAX
.WEOF
macro (evaluates to-1
).MB_LEN_MAX
andMB_CUR_MAX
macros (both evaluate to4
, because the multibyte encoding is always UTF-8).- wide character related
stdlib.h
functions (e.g.wcstombs
,mbstowcs
,mblen
). - nonstandard
mbstowc
function, which is an alternative tombtowc
, but expects awchar_t*
instead ofwchar
, to be able to read surrogate pairs in UTF-16.