<a href="https://github.com/simdutf/simdutf/blob/master/tests/reference/validate_latin

tests/reference /validate_latin1 implemented incorrectly about simdutf HOT 13 CLOSED

Nambers commented on June 15, 2024 1

tests/reference /validate_latin1 implemented incorrectly

from simdutf.

Comments (13)

lemire commented on June 15, 2024 2

Ok. So this makes sense. We will add this functionality.

from simdutf.

lemire commented on June 15, 2024

It is dead code. Let us remove it.

from simdutf.

Nambers commented on June 15, 2024

btw is there a plan for adding support to validate_latin1?

from simdutf.

lemire commented on June 15, 2024

btw is there a plan for adding support to validate_latin1?

It is not logically possible. Any sequence of bytes is valid latin1.

from simdutf.

Nambers commented on June 15, 2024

It is not logically possible. Any sequence of bytes is valid latin1.

Oh what I means is logically equal to is every unicode unit in this utf8/16/32 etc encoded sequence has value less than 0xFF. Is that what the definition of Latin1? or I just misunderstanding it.

from simdutf.

lemire commented on June 15, 2024

Can you describe your use case?

from simdutf.

Nambers commented on June 15, 2024

Can you describe your use case?

e.g. given an utf8 sequence

1byte utf8 -> true
2bytes utf8 -> check if its unicode value is <= 0xFF by parsing ((a & 0b00011111) << 6 | (b & 0b00111111))
3/4bytes utf8 -> false

ig it more like validate_latin1_from_utf8 for this case? I don't know much about other encoding

from simdutf.

lemire commented on June 15, 2024

I understand the functionality, but what is your application?

We try to focus on providing functionality that is useful in practice. So I want to understand the motivation.

from simdutf.

Nambers commented on June 15, 2024

For me it mostly because I need to transfer any utf8 sequences stored in char* to Python Unicode object in C which required to provide max unit value(can be 0x7f for ascii, 0xff for latin1, 0xffff or 0x10ffff). Plus it can be used for the checking in the latin1_length_from_utf8?

But I also admit that idk how much can SIMD improve this process since it seems straightforward - make a 0b1000_0000 mm128/256 mask, cmplt_mask every batch and for loop every batch that mask is not equal to 0.

from simdutf.

lemire commented on June 15, 2024

For me it mostly because I need to transfer any utf8 sequences stored in char* to Python Unicode object in C which required to provide max unit value

Can you point at the API you are trying to support?

Is that part of the standard Python API?

from simdutf.

Nambers commented on June 15, 2024

Can you point at the API you are trying to support?

Is that part of the standard Python API?

It is PyUnicode_New. I don't want to use PyUnicode_FromString since it required an additional buffer which may be costly when the len is large. So I'll will use PyUnicode_New to create the buffer and write into it directly, which required me to determine the type of PyUnicode as I said.

from simdutf.

lemire commented on June 15, 2024

@Nambers Can you give me a full code routine? How would you use afterward efficiently? Have you run benchmarks?

I understand here that the idea is that we could take UTF-8 strings in C or C++ and move them to Python at high speed. If simdutf can provide functionality to help with that, we will, for sure, add it.

But I'd like to make sure that it is worth our time to do it.

cc @TkTech

from simdutf.

Nambers commented on June 15, 2024

I understand here that the idea is that we could take UTF-8 strings in C or C++ and move them to Python at high speed. If simdutf can provide functionality to help with that, we will, for sure, add it.

Sounds great. So basically I need a function using SIMD to get the max char value or just detect if it is Latin1 (<= 0xFF) or 3bytes utf8 (<= 0xFFFF).
Here is my implementation so far str.c#L11 (I also need to consider unicode escape in it, but ig it is not sth simdutf need to consider).
Then after determined the kind, we just do some fast paths like following content in my file.

ref:
cjson, ujson, and python standard json implementations didn't use simd.
orjson also didn't use SIMD, they use some rust func create.rs#L15
I can't say what method is used by pysimdjson

EDIT:
I was trying to use count_utf8(buf) == latin1_length_from_utf8(buf) to check if it is latin1, but idk if it is costly.

from simdutf.

tests/reference /validate_latin1 implemented incorrectly about simdutf HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent