Giter Site home page Giter Site logo

Comments (13)

lemire avatar lemire commented on June 15, 2024 2

Ok. So this makes sense. We will add this functionality.

from simdutf.

lemire avatar lemire commented on June 15, 2024

It is dead code. Let us remove it.

from simdutf.

Nambers avatar Nambers commented on June 15, 2024

btw is there a plan for adding support to validate_latin1?

from simdutf.

lemire avatar lemire commented on June 15, 2024

btw is there a plan for adding support to validate_latin1?

It is not logically possible. Any sequence of bytes is valid latin1.

from simdutf.

Nambers avatar Nambers commented on June 15, 2024

It is not logically possible. Any sequence of bytes is valid latin1.

Oh what I means is logically equal to is every unicode unit in this utf8/16/32 etc encoded sequence has value less than 0xFF. Is that what the definition of Latin1? or I just misunderstanding it.

from simdutf.

lemire avatar lemire commented on June 15, 2024

Can you describe your use case?

from simdutf.

Nambers avatar Nambers commented on June 15, 2024

Can you describe your use case?

e.g. given an utf8 sequence

  • 1byte utf8 -> true
  • 2bytes utf8 -> check if its unicode value is <= 0xFF by parsing ((a & 0b00011111) << 6 | (b & 0b00111111))
  • 3/4bytes utf8 -> false

ig it more like validate_latin1_from_utf8 for this case? I don't know much about other encoding

from simdutf.

lemire avatar lemire commented on June 15, 2024

I understand the functionality, but what is your application?

We try to focus on providing functionality that is useful in practice. So I want to understand the motivation.

from simdutf.

Nambers avatar Nambers commented on June 15, 2024

For me it mostly because I need to transfer any utf8 sequences stored in char* to Python Unicode object in C which required to provide max unit value(can be 0x7f for ascii, 0xff for latin1, 0xffff or 0x10ffff). Plus it can be used for the checking in the latin1_length_from_utf8?

But I also admit that idk how much can SIMD improve this process since it seems straightforward - make a 0b1000_0000 mm128/256 mask, cmplt_mask every batch and for loop every batch that mask is not equal to 0.

from simdutf.

lemire avatar lemire commented on June 15, 2024

For me it mostly because I need to transfer any utf8 sequences stored in char* to Python Unicode object in C which required to provide max unit value

Can you point at the API you are trying to support?

Is that part of the standard Python API?

from simdutf.

Nambers avatar Nambers commented on June 15, 2024

Can you point at the API you are trying to support?

Is that part of the standard Python API?

It is PyUnicode_New. I don't want to use PyUnicode_FromString since it required an additional buffer which may be costly when the len is large. So I'll will use PyUnicode_New to create the buffer and write into it directly, which required me to determine the type of PyUnicode as I said.

from simdutf.

lemire avatar lemire commented on June 15, 2024

@Nambers Can you give me a full code routine? How would you use afterward efficiently? Have you run benchmarks?

I understand here that the idea is that we could take UTF-8 strings in C or C++ and move them to Python at high speed. If simdutf can provide functionality to help with that, we will, for sure, add it.

But I'd like to make sure that it is worth our time to do it.

cc @TkTech

from simdutf.

Nambers avatar Nambers commented on June 15, 2024

I understand here that the idea is that we could take UTF-8 strings in C or C++ and move them to Python at high speed. If simdutf can provide functionality to help with that, we will, for sure, add it.

Sounds great. So basically I need a function using SIMD to get the max char value or just detect if it is Latin1 (<= 0xFF) or 3bytes utf8 (<= 0xFFFF).
Here is my implementation so far str.c#L11 (I also need to consider unicode escape in it, but ig it is not sth simdutf need to consider).
Then after determined the kind, we just do some fast paths like following content in my file.

ref:
cjson, ujson, and python standard json implementations didn't use simd.
orjson also didn't use SIMD, they use some rust func create.rs#L15
I can't say what method is used by pysimdjson

EDIT:
I was trying to use count_utf8(buf) == latin1_length_from_utf8(buf) to check if it is latin1, but idk if it is costly.

from simdutf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.