Giter Site home page Giter Site logo

Add Monotonicity about ulid HOT 17 CLOSED

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024
Add Monotonicity

from ulid.

Comments (17)

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024 2

To solve these problems, I would advise to calculate monotonic ULID as

ttttttttttsssrrrrrrrrrrrrr
(string representation in Crockford's base32)

where
t is Timestamp (10 characters), UNIX-time in milliseconds (UTC)
s is Sequence (3 characters), generated for each ULID with the same database and Timestamp
r is Randomness (13 characters), generated in advance by true random number generator, separately for each ULID

from ulid.

neg3ntropy avatar neg3ntropy commented on May 12, 2024 2

I hope I am not being irritating, but this is an interesting discussion, so I will say one more thing.

The gap in the space of possible ULID is a non issue. ULID are already very sparse and use 128 bits, while a db sequence would very easily work with 32 bits, 64 at most. Lots of possible ULIDs will never be used just because you don't create objects every millisecond, and even if you would, the randomness would still make the space very sparse.

The bits used for microsecond precision would have the same efficiency as the random bits, as more or less they are random. Also they would need to be 10 bits as there are 1000 microseconds to a millisecond, but we don't need to worry that we are eating into the randomness, because we are still reducing the chance of collision with those bits.

We accept the sparse nature of ULIDs (and UUIDs), because it's distributed id generation: we don't need to hit a database to get one. Everything can create a ULID and expect no collision.

This is also why I would not go down the sequence road: it assumes you have shared memory between consumers, but that could very well not be the case with multiple processes. In python we do use processes more than threads because of the GIL.

In conclusion, I think that microseconds would work better in distributed environments, while being easier to implement and faster. The sequence approach is valid when you don't have microsecond precision available.

from ulid.

ahawker avatar ahawker commented on May 12, 2024 1

Monotonicity wasn't actually part of the spec when I wrote this library.

I'll take a deeper look into the details.

https://github.com/ulid/spec#monotonicity

from ulid.

neg3ntropy avatar neg3ntropy commented on May 12, 2024 1

Just an idea, sorry if it's stupid, but since time sources can have microsecond precision, how about taking anything under the millisecond as the most significant random bits? Wouldn't it improve monotonicity by quite a lot in practice and still be very efficient and simple?

from ulid.

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024 1

Taking into account that generating a ULID on the fly takes more than a microsecond, you should in advance generate 32768 ULIDs with Sequence to the buffer for every millisecond and use them from the buffer including at peak load.

But your proposed use of a microsecond counter as a sequence will result in gaps in the sequence of microseconds recorded to ULIDs. That is, the three characters allocated for microseconds will be used less efficiently than the same three characters allocated for the Sequence. Fewer ULIDs will be generated than possible. And besides, thousands of new record will have to wait for generating new ULIDs on the fly: one ULID per more than a microsecond.

I think the best place fo ULID generator is DBMS, not Python. But unfortunately DBMS developers have the wrong priorities.

from ulid.

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024

Andrew,
You'd better take the Monotonicity from the popular implementation of ULID in Golang. It's much more secure, because it's impossible to find a valid ID by increment or decrement of the known ID.

Sergey

from ulid.

ahawker avatar ahawker commented on May 12, 2024

Based on my reading, you are limited to a random number of values you can generate within the same millisecond. If your first randomness value is sufficiently high, you can only generate within that limited window until you overflow.

The implementation in https://github.com/oklog/ulid appears to do this, so it's really best effort but unlikely you'll be able to create 2^80 values within the same millisecond.

Further reading lead me to ulid/spec#11 which brings up the issue/discrepancy.

If we implement a best effort approach, it would require the following:

Changes to the following to use an entropy source instead of os.urandom directly:

  • api.new
  • api.from_timestamp

The big addition here is going to be a stateful entropy source. The concurrency guarantees of this entropy source are still unknown and don't appear to be defined within the spec. The default JS implementation does not appear to be thread safe. Investigation of other implementations is necessary to see what guarantees other libraries are attempting to make. Based on this investigation, I can imagine this entropy source being a global instance, thread safe via locks, and created at import time
when there is less likely a change for race condition. The other implementation could be using a thread local in which the entropy source would maintain the monotonicity requirements per-thread but not per-process. This appears to be what the default implementation does but I'm not really a fan of this approach.

Next Steps:

  • - Investigate other implementations to see (or lacking) concurrency guarantees.

from ulid.

vincent-grosbois avatar vincent-grosbois commented on May 12, 2024

Would it be possible to just loop to create ULIDs until the new ULIDs is bigger than the previous one?
Then it would always be monotonic... at worst you would loop during 1ms until the initial part is incremented

from ulid.

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024

Just look at this implemetation of guaranteed multiple increasing ULIDs per millisecond: ULID with sequence

from ulid.

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024

Hi Andrea,
Microseconds are better than milliseconds. However microseconds, unlike Sequence, will not solve the problem of short-term peak loads.
For example, during one microsecond thousands of records may appear for which we need to generate ULIDs, but before and after this microsecond the flow of new records is much weaker. Adding microseconds will not solve this problem, but Sequence will solve it.

from ulid.

neg3ntropy avatar neg3ntropy commented on May 12, 2024

I guess generating a ULID takes more than a microsecond even on a fast CPU, so that covers the single thread case.

In [1]: from time import time                                                                                      

In [2]: a,b=(time(),time())                                                                                        

In [3]: a                                                                                                          
Out[3]: 1593030169.206382

In [4]: b                                                                                                          
Out[4]: 1593030169.2063825

When you have multiple threads, already the order of operations is indeterminate to a point, so probably nobody would care. Furthermore that in python such a case is even possible is also to be demonstrated.

So microsecond precision does not solve the issue in theory but I would say the case where it does ruin somebody's day, just does not exist.

from ulid.

ahawker avatar ahawker commented on May 12, 2024

@neg3ntropy I had come back to this issue a week ago or so and actually started researching the same thing; definitely not a stupid idea! It's another approach to trading a bit of entropy for a value that is guaranteed to be incrementing, in this case, the system clock. (Ignoring the fact that the system clock can be changed, thus breaking any montonic property of time.time).

The reasons I have not (yet) tried implemented it:

  1. It's against the ULID spec. This is a concern, albeit, a small one as we're not getting much/any guidance from the spec author.
  2. It's more prone to the whims of the platform the code is running on. Windows was notorious for only have 16ms accuracy on their clocks. This PEP 564 annex outlines some of the observed values by platform. If these numbers are accurate, I believe there would be some cases where trading randomness bits for timestamp bits (that don't change fast enough because of clock frequency) would mean you might have more chances for collision/out of order values.

Neither of these issues are true blockers, I just wanted to post where my current research and thinking was.

from ulid.

sergeyprokhorenko avatar sergeyprokhorenko commented on May 12, 2024

Hi Andrea,

I don't care of the gap between two sequential ULIDs in the space of possible ULIDs at all. I mean exactly the gap in the sequence of microseconds recorded to ULIDs. These two kinds of gap are very different things.

Imagine you are near the end of a millisecond, and suddenly peak load occured. You don't have enough remaining microseconds in this millisecond to generate sufficient number of ULIDs. But you have enough variants of the Sequence in pre-generated ULIDs for this millisecond.

Ordered IDs are important for databases only to shorten the search time. ULIDs with Sequence are intended for generation within database, and ULIDs with sequence are better than microseconds in this case (see previous paragraph). If you merge records with ULIDs with Sequence from several databases, you should sort the records within every millisecond in the buffer or generate own ULIDs within the target database. The sorting would be easier with microseconds (and the best available time resolution is 100 ns) instead of milliseconds+Sequence, but lack of Sequence may cause problems in source databases. If the sources are not databases, and you don't need quick search in them, than microseconds (without Sequence) would be enough.

from ulid.

ahawker avatar ahawker commented on May 12, 2024

Monotonic implementation was added in #473 with the goal of implementing it "as defined by the spec". As discussed here and in other GH issues/threads, there are some problems/questions with that spec. However, the goal at this point should be interoperability with other ULID implementations with future effort put into proposing changes to the spec.

I'll try and get a new version on pypi today with these additions.

from ulid.

neg3ntropy avatar neg3ntropy commented on May 12, 2024

I have reviewed providers/monotonic.py only. I still would have favoured a solution with timestamp extra bits as randomness, maybe only platform where such precision is available, but no thread locks and state. I think that would be better for performance as well as ordering in distributed/multiprocessing environments.

However I am happy for the development and this already covers my use case well enough.

from ulid.

ahawker avatar ahawker commented on May 12, 2024

I took into account the discussions in this thread during design and I believe implementing a microsecond based ULID would be straight forward with a new provider implementation. I'll probably take a swing at it one of these weekends to see if that assumption is correct.

from ulid.

ahawker avatar ahawker commented on May 12, 2024

Implementation using microseconds is in #476

from ulid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.