Comments (13)
Ok. So this makes sense. We will add this functionality.
from simdutf.
It is dead code. Let us remove it.
from simdutf.
btw is there a plan for adding support to validate_latin1
?
from simdutf.
btw is there a plan for adding support to validate_latin1?
It is not logically possible. Any sequence of bytes is valid latin1.
from simdutf.
It is not logically possible. Any sequence of bytes is valid latin1.
Oh what I means is logically equal to is every unicode unit in this utf8/16/32 etc encoded sequence has value less than 0xFF. Is that what the definition of Latin1? or I just misunderstanding it.
from simdutf.
Can you describe your use case?
from simdutf.
Can you describe your use case?
e.g. given an utf8 sequence
- 1byte utf8 -> true
- 2bytes utf8 -> check if its unicode value is <= 0xFF by parsing
((a & 0b00011111) << 6 | (b & 0b00111111))
- 3/4bytes utf8 -> false
ig it more like validate_latin1_from_utf8
for this case? I don't know much about other encoding
from simdutf.
I understand the functionality, but what is your application?
We try to focus on providing functionality that is useful in practice. So I want to understand the motivation.
from simdutf.
For me it mostly because I need to transfer any utf8 sequences stored in char* to Python Unicode object in C which required to provide max unit value(can be 0x7f for ascii, 0xff for latin1, 0xffff or 0x10ffff). Plus it can be used for the checking in the latin1_length_from_utf8
?
But I also admit that idk how much can SIMD improve this process since it seems straightforward - make a 0b1000_0000 mm128/256 mask, cmplt_mask
every batch and for loop every batch that mask is not equal to 0.
from simdutf.
For me it mostly because I need to transfer any utf8 sequences stored in char* to Python Unicode object in C which required to provide max unit value
Can you point at the API you are trying to support?
Is that part of the standard Python API?
from simdutf.
Can you point at the API you are trying to support?
Is that part of the standard Python API?
It is PyUnicode_New. I don't want to use PyUnicode_FromString
since it required an additional buffer which may be costly when the len is large. So I'll will use PyUnicode_New
to create the buffer and write into it directly, which required me to determine the type of PyUnicode as I said.
from simdutf.
@Nambers Can you give me a full code routine? How would you use afterward efficiently? Have you run benchmarks?
I understand here that the idea is that we could take UTF-8 strings in C or C++ and move them to Python at high speed. If simdutf can provide functionality to help with that, we will, for sure, add it.
But I'd like to make sure that it is worth our time to do it.
cc @TkTech
from simdutf.
I understand here that the idea is that we could take UTF-8 strings in C or C++ and move them to Python at high speed. If simdutf can provide functionality to help with that, we will, for sure, add it.
Sounds great. So basically I need a function using SIMD to get the max char value or just detect if it is Latin1 (<= 0xFF
) or 3bytes utf8 (<= 0xFFFF
).
Here is my implementation so far str.c#L11 (I also need to consider unicode escape in it, but ig it is not sth simdutf need to consider).
Then after determined the kind, we just do some fast paths like following content in my file.
ref:
cjson
, ujson
, and python standard json
implementations didn't use simd.
orjson
also didn't use SIMD, they use some rust func create.rs#L15
I can't say what method is used by pysimdjson
EDIT:
I was trying to use count_utf8(buf) == latin1_length_from_utf8(buf)
to check if it is latin1, but idk if it is costly.
from simdutf.
Related Issues (20)
- Use fmtlib where appropriate
- create higher level base64 functions HOT 4
- Incorrect processor detection when cross compiling HOT 2
- RVV port for Base64 procedures HOT 4
- RISC-V RVV CI tests broken
- Base64 decoder is currently too lenient with padding characters
- `simdutf::result` constructor is not implemented HOT 4
- SIMDUTF_CAN_ALWAYS_RUN_* macros broken HOT 2
- Crosscompilation for RVV failed HOT 3
- CI: run tests in parallel for all targets HOT 2
- Compiler failure on ARM64 Windows HOT 7
- Possible Issue With .pc.in? HOT 7
- Crash on Windows when main thread exits while a different thread is using simdutf HOT 1
- UTF-32 endian support HOT 2
- #include inside namespace breaks symbols for regular use on riscv64 HOT 2
- Build failures due to forcing AVX512 types on an AVX2 system HOT 1
- Add fast function to characterize a UTF-8 string HOT 3
- warning with gcc14 on c++20 mode : warning: template-id not allowed for constructor in C++20 [-Wtemplate-id-cdtor] HOT 5
- Constant-time base64 decoder HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from simdutf.