uuid6 / uuid6-ietf-draft Goto Github PK
View Code? Open in Web Editor NEWNext Generation UUID Formats
Home Page: https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/
Next Generation UUID Formats
Home Page: https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/
@bradleypeabody, @kyzer-davis,
This is a continuation of my comment on PR #10
I thought of another approach. Instead of treating the integer part and the fractional part separately, the entire timestamp can be encoded in a fixed-point number.
I imagined another structure that reserves the last 32 bits for random bits or application-specific data. I'm not sure if anyone will need more than 90 bits for the timestamp. A field with 38 bits for the integer part and 52 bits for the fractional part is enough store a decimal number with 15 digits after the decimal point, like this: 0.123456789012345.
The structure described below also works with 48 bits reserved for milliseconds, requiring very small changes. By the way, I think that 48 bits for milliseconds may be more convenient in contexts where the best precision you can get is milliseconds, as in Java 8, because you don't need to encode the sub-millisecond part.
Sorry for sending spam to your project. I am very happy with the progress of this RFC update. I'm just trying to contribute some ideas, and I don't expect them to be accepted.
This proposed structure consists of two fields: unixtime and random. The only mandatory requirements are to fill in the first 38 bits (48 bits for ms is better?) of the unixtime
field with the current seconds and fill in the 6 bits reserved for version and variant with their standardized values. The number of sub-second bits in the unixtime
field is variable/arbitrary. And the random
field does not have to be filled with random bits. The name of this field is just a hint that its value by default is random.
unixtime: a 90-bit field reserved for the Unix timestamp. This field is encoded as a fixed-point number in the form fixed<38,n>
, where 38
is the number of bits before the binary point, and n
is the number of bits after the binary point. The size n
is arbitrary so that it can be defined by the system that generates the UUID. The bits not used in this field can be filled with a sequence counter and/or some random bits, for example. The timestamp can represent dates up to the year 10680 A.D. (2^38/60/60/24/365.25 + 1970).
random: a 32-bit field reserved for random bits or application-specific data. Part or all of this field can be used, for example, to identify the UUID generator. This field is opaque, so its bits have no meaning, except for the system that generates the UUID.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixtime |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixtime | ver | unixtime |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var| unixtime |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| random |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Say you have a fractional value between 0 and 1 represented as f
, you can encode that fractional value with the precision of n
bits like this:
subsec = f * 2^n
To decode a subsec
with n
precision you do the inverse operation:
f = subsec / 2^n
This tutorial describes an efficient method for converting floating-point to fixed-point in MATLAB: Fixed-Point.pdf.
2^n
, where n
is the number of bits that fits to the system time precision. If the precision is millisecond, n
is equal to 10 (2^10 = 1,024). If the precision is microsecond, n
is equal to 20 (2^20 = 1.048.576) and so on;n
bits of the UUID with the n
bits of the previous multiplication product;n
bits starting from the 38th bit position of the UUID, where n
is the number of bits that fits the system time precision;2^n
to get a fractional value;To obtain a floating-point representation of the timestamp you can sum the two parts.
Yesterday I implemented this simple python generator to test the concept:
#!/usr/bin/python3
#
# @date: 2021-03-06
# @author: Fabio Lima
#
# Changelog:
# - 2021-03-16: add parameter `subsec_decimal_digits` to function `uuid7()`
# - 2021-03-08: use `time.time_ns()` instead of `time.time()` to fix nanosecond precision
#
from uuid import UUID
from time import time_ns
from random import randint
TOTAL_BITS=128
VERSION_BITS = 4
VARIANT_BITS = 2
# Binary digits before the binary point
SEC_BITS=38
# Binary digits after the binary point
SUBSEC_BITS_S=0
SUBSEC_BITS_MS=10
SUBSEC_BITS_US=20
SUBSEC_BITS_NS=30
SUBSEC_BITS_DEFAULT=SUBSEC_BITS_NS
# Decimal digits after the decimal point
SUBSEC_DECIMAL_DIGITS_S=0 # 0
SUBSEC_DECIMAL_DIGITS_MS=3 # 0.999
SUBSEC_DECIMAL_DIGITS_US=6 # 0.999999
SUBSEC_DECIMAL_DIGITS_NS=9 # 0.999999999
SUBSEC_DECIMAL_DIGITS_DEFAULT=SUBSEC_DECIMAL_DIGITS_NS
SLICE_MASK_0 = 0xffffffffffff00000000000000000000
SLICE_MASK_1 = 0x0000000000000fff0000000000000000
SLICE_MASK_2 = 0x00000000000000003fffffffffffffff
def uuid7(t=None, subsec_bits=SUBSEC_BITS_DEFAULT, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT):
if t == None:
t = time_ns()
i = get_integer_part(t)
f = get_fractional_part(t, subsec_decimal_digits)
sec = i
subsec = round(f * (2 ** subsec_bits))
node_bits = (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS - subsec_bits)
uuid_sec = sec << (subsec_bits + node_bits)
uuid_subsec = subsec << node_bits
uuid_node = randint(0, (2 ** node_bits))
uuid_int = uuid_sec | uuid_subsec | uuid_node # 122 bits
uuid_int = __add_version__(uuid_int) # 128 bits
return UUID(int=uuid_int)
def uuid7_s(t=None):
return uuid7(t, SUBSEC_BITS_S, SUBSEC_DECIMAL_DIGITS_S)
def uuid7_ms(t=None):
return uuid7(t, SUBSEC_BITS_MS, SUBSEC_DECIMAL_DIGITS_MS)
def uuid7_us(t=None):
return uuid7(t, SUBSEC_BITS_US, SUBSEC_DECIMAL_DIGITS_US)
def uuid7_ns(t=None):
return uuid7(t, SUBSEC_BITS_NS, SUBSEC_DECIMAL_DIGITS_NS)
def __add_version__(uuid_int):
slice_mask_0 = SLICE_MASK_0 >> (VERSION_BITS + VARIANT_BITS)
slice_mask_1 = SLICE_MASK_1 >> (VARIANT_BITS)
slice_mask_2 = SLICE_MASK_2
slice_0 = (uuid_int & slice_mask_0) << (VERSION_BITS + VARIANT_BITS)
slice_1 = (uuid_int & slice_mask_1) << (VARIANT_BITS)
slice_2 = (uuid_int & slice_mask_2)
uuid_output = slice_0 | slice_1 | slice_2
uuid_output = uuid_output & 0xffffffffffff0fff3fffffffffffffff # clear version
uuid_output = uuid_output | 0x00000000000070008000000000000000 # apply version
return uuid_output
def __rem_version__(uuid_int):
slice_0 = (uuid_int & SLICE_MASK_0) >> (VERSION_BITS + VARIANT_BITS)
slice_1 = (uuid_int & SLICE_MASK_1) >> (VARIANT_BITS)
slice_2 = (uuid_int & SLICE_MASK_2)
uuid_output = slice_0 | slice_1 | slice_2
return uuid_output
def get_integer_part(t):
SUBSEC_DECIMAL_DIGITS_PYTHON=9
subsec_decimal_divisor = (10 ** SUBSEC_DECIMAL_DIGITS_PYTHON)
return int(t / subsec_decimal_divisor)
def get_fractional_part(t, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT):
SUBSEC_DECIMAL_DIGITS_PYTHON=9
subsec_decimal_divisor = (10 ** SUBSEC_DECIMAL_DIGITS_PYTHON)
return round((t % subsec_decimal_divisor) / subsec_decimal_divisor, subsec_decimal_digits)
def extract_sec(uuid):
uuid_int = __rem_version__(uuid.int)
uuid_sec = uuid_int >> (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS)
return uuid_sec
def extract_subsec(uuid, subsec_bits=SUBSEC_BITS_DEFAULT, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT):
uuid_int = __rem_version__(uuid.int)
node_bits = (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS - subsec_bits)
uuid_subsec = (uuid_int >> node_bits) & ((1 << (subsec_bits)) - 1)
return round(uuid_subsec / (2 ** subsec_bits), subsec_decimal_digits)
def list():
print("UUIDv7 sec in sec out subsec in subsec out")
for i in range(10):
t = time_ns()
u = uuid7(t)
i = get_integer_part(t)
f = get_fractional_part(t)
sec = extract_sec(u)
subsec = extract_subsec(u)
print(u, str(i).ljust(12), str(sec).ljust(12), str(f).ljust(12), str(subsec).ljust(12))
list()
OUTPUT:
UUIDv7 sec in sec out subsec in subsec out
018145e0-5fc5-7a65-ae82-507fbfe22749 1615951895 1615951895 0.943017418 0.943017418
018145e0-5fc5-7bad-b660-9ddd65978a4c 1615951895 1615951895 0.943095648 0.943095648
018145e0-5fc5-7c84-a49c-a9797ed70dc4 1615951895 1615951895 0.943146842 0.943146842
018145e0-5fc5-7d46-b435-4c76da9142c5 1615951895 1615951895 0.943193153 0.943193153
018145e0-5fc5-7de2-8716-009c3130d332 1615951895 1615951895 0.943230178 0.943230178
018145e0-5fc5-7e74-9231-05a35740fba0 1615951895 1615951895 0.943265028 0.943265028
018145e0-5fc5-7f4a-8022-e1ba32698bc3 1615951895 1615951895 0.943315983 0.943315983
018145e0-5fc5-7fd5-a568-b1d7bc223a91 1615951895 1615951895 0.943349262 0.943349262
018145e0-5fc6-7062-8b99-ba3f383d9b8e 1615951895 1615951895 0.943382783 0.943382783
018145e0-5fc6-70ed-98ce-fa5bf04836d8 1615951895 1615951895 0.943415972 0.943415972
The first version of the generator above used time.time()
. This method appears to return floating point numbers with 22 bits after the binary point. I had to replace it with time.time_ns()
to fix the nanosecond precision. If you need nanosecond precision, don't use time.time()
.
This loop prints the sub-seconds returned by time.time()
in binary format:
from time import time
for i in range(10):
n = 30 # system time precision
t = time() # get the current time
i = int(t) # integer part
f = t - int(t) # fractional part
# encode the fractional part to an integer
subsec = int(f * (2**n))
# print the result in binary format
print(bin(subsec))
This is the output of the previous loop:
|---------30 bits-----------|
0b10011010010000101110100000000
0b10011010010010000100100000000
0b10011010010010100100000000000
0b10011010010011010001000000000
0b10011010010011111000100000000
0b10011010010100001100100000000
0b10011010010100011111000000000
0b10011010010100110001100000000
0b10011010010101000011100000000
0b10011010010101010110000000000
|-------------------|-------|
empty
Note that the last 8 bits are zeros.
node
was originally defined as "MAC address or a random value". Then in this spec we've decided to be explicit about the fact it should not be a MAC address. Thus, the implication is that node
MUST be a random value.
Meanwhile the definition of the subsec_seq_node
field makes no mention of "node", other than in the name:
The remaining 62 bits which MAY be allocated to any combination of additional sub-second precision, sequence counter, or pseudo-random data.
(But it's at least explicit about the field holding subsecond time, sequence, or random data)
However in section 4.4.3 we define "node" as "the leftover bits" that aren't allocated to subsecond or sequence values, then rather tepidly state, "In most scenarios the node SHOULD be filled with pseudo-random data".
But that opens the door for it being something other than random values. But no mention is made of what else might be used there, or how that would be possible without falling afoul of the implied (legacy) definition.
Basically, its use is confusing and adds little value. We should just be explicit about the fact the "leftover bits" should be set to pseudo-random values and avoid the term altogether unless we're talking about the legacy definition (in version 6).
Draft 02, Section 5.2. Monotonicity and Counters
The following methods SHOULD be employed for single-node UUID implementations that do require batch UUID creation.
Fixed-Length Dedicated Counter Bits (Position A):
Fixed-Length Dedicated Counter Bits (Position B):
Monotonic Random:
Implementations SHOULD choose one method for single-node UUID implementations that require batch UUID creation.
Fixed-Length Dedicated Counter Bits (Method 1):
Fixed-Length Dedicated Counter Bits (Method 2):
Monotonic Random (Method 3):
Current text reads as if all methods should be leveraged when in reality there are two methods and some supporting text about the fixed-length methods.
Hello,
I was on the fence as to whether I should include this in #24, or file it as its own issue. If that other issue is better, let me know and I'll add a comment there instead.
I am working on an implementation of the latest draft in Go, and when using millisecond precision at full-bore I can easily increment the clock sequence to the point of rolling over the 12 bit integer. When discovering this I wanted to understand how the rollover should be handled, and I noticed some issues with how section 4.4.2
is worded that contradict themselves and ultimately don't provide explicit direction on how to handle the rollover.
4.4.2. UUIDv7 Clock Sequence Usage
UUIDv7 SHOULD utilize a monotonic sequence counter to provide
additional sequencing guarantees when multiple UUIDv7 values are
created in the same UNIXTS and SUBSEC timestamp. The amount of bits
allocates to the sequence counter depend on the precision of the
timestamp. For example, a more accurate timestamp source using
nanosecond precision will require less clock sequence bits than a
timestamp source utilizing seconds for precision. For best
sequencing results the sequence counter SHOULD be placed immediately
after available sub-second bits.
The clock sequence MUST start at zero and increment monotonically for
each new UUIDv7 created on by the application on the same timestamp.
When the timestamp increments the clock sequence MUST be reset to
zero. The clock sequence MUST NOT rollover or reset to zero unless
the timestamp has incremented. Care MUST be given to ensure that an
adequate sized clock sequence is selected for a given application
based on expected timestamp precision and expected UUIDv7 generation
rates.
The initial issue is that the first and second paragraph contradict each other. The first says we SHOULD utilize a monotonic sequence counter, without offering guidance on how to implement it without using a sequence counter. The second paragraph then goes on to say "The clock sequence MUST...", which with the usage of the
indicates that there is only one type of clock sequence that should be used. This makes me think the SHOULD
in the first paragraph might benefit from becoming a MUST
.
The other issue I ran into is that it's unclear how to proceed when the sequence counter would roll over without the timestamp having changed. I understand that I should not permit it to roll over, but I don't know whether that's failure condition and I should return an error or whether I should just generate UUIDv7 with the maximum sequence number until the timestamp changes. Would there be value in being explicit about how to handle that failure mode?
The truth is, I believe the majority of organizations would lean on publicly available UUID implementations over implementing their own. To me that means ambiguities are going to result in implementation differences between languages and libraries within single languages, which could be problematic in a polyglot environment if the different implementations have their own interpretations of the RFC.
Separately from the wording in the RFC, is there something we can do about the clock sequence hitting the maximum value when using millisecond precision? Seeing how easy it was to hit the maximum clock sequence value, it seems to strongly devalue the millisecond precision variant of UUIDv7 to the point I'd probably actively discourage its use in favor of the microsecond precision.
Would love to hear your thoughts on both of these.
Playing Devil's Advocate: One of my first impressions of this proposal is that introducing three new versions seems "excessive".
Collectively, these three versions introduce a number of much-needed improvements. However I'm not entirely clear on what purpose the multiple versions serve. 'Seems like this is unnecessarily complicating the development of this spec.
For example, I would suggest the spec for version 7 provides the most value at present. It supports Unix timestamps, db-friendly byte-ordering, arbitrary sub-second resolution, and prohibits the use of MAC addresses, all of which are great. I.e. This is pretty obviously what most applications that need timestamp UUIDs should be using moving forward.
Meanwhile version 6 has nothing to recommend it over version 1 other than the reordering of time_*
fields. I assume the nominal use case for this is to make it trivial to migrate version 1 uuids to a more db-friendly form? But I don't think that actually works because v6 currently prohibits the use of MAC addresses in the node
field. Thus, converting from v1 -> v6 will be necessarily non-deterministic. (The node
field will have to be set randomly in order to remain in compliance with this spec, since there's no way of knowing for sure that the v1 node
is not a MAC Address.)
For anyone wanting to convert v1 uuids to v6 across a large, distributed system, that's going to be a huge issue.
As for version 8... well, I confess I don't really understand the use case for this yet. My first impression is that it's so abstract as to be almost meaningless. Technically, you could have a single timestamp bit ("The Boolean Epoch"?!?) with 121 bits of random data and still conform to the spec, no? I suppose (???) there's some value in allowing for arbitrary timestamp epoch/resolution/random bit construction of uuids but a spec that starts with, "The only explicitly defined bits are the Version and Variant leaving 122 bits for implementation specific time-based UUIDs" doesn't seem like much of a spec. It's more just a hand-wavy catchall that's going to invite people to create ad-hoc, proprietary sub-versions.
"It's a version 8 UUID, but we [Blargle Fargle Inc.] set bits 73-78 to b110011
for our PulsyWulsy™ blood oximeter product."
See also #29.
Need to fix still,
4.4. UUIDv7 Layout and Bit Order
Old:
The format for the 16-octet, 128-bit UUIDv6 is shown in Figure 2
New:
The format for the 16-octet, 128 bit UUIDv7 is shown in Figure 2
Reported by Blachly, James [email protected] on the mailer
Already fixed, under #14, found by Ville Skyttä [email protected] on the mailer
Old
The 4 bit UUIDv8 version (0111)
New:
The 4 bit UUIDv7 version (0111)
There are a lot of different individual concerns along with ideas on how things could be solved. It would be valuable to have a concise list of these along with the current decision on what the outcome is as expressed in the current draft. It should summarize the concern/question, concisely list out the alternatives and/or the pros and cons, and the decision as currently expressed in the draft. This should make discussions about specific features very focused and avoid having to repeat the same concerns over and over. It also means that good ideas and valid concerns that are still being considered have a place they can be recorded and summarized and made easily accessible. The README for this repo is probably the appropriate place. Visible, and people can submit PRs to add things (which I will readily merge as long as they concisely include data and pros and cons and leave out the "therefore MUST do something this way" - what we need instead is "the concerns are A, B, C solution X solves A and B but not C, solution Y solves all three but introduces problem D"). If all of the data can be concisely collected, the ensuing discussions will be far more productive and then the exact reason for changes can be documented as well.
Examples of things to cover:
There are a bunch of these and they have exact arguments for various solutions, and only a subset of that information is covered in the draft.
Related: bradleypeabody/gouuidv6#3
The proposed UUIDv8 implementation (section 4.5.4) contains a couple typos:
time_or_seq
seq_or_node
contains only 8 bits, not 12Just as general feedback: UUIDv8 feels oddly specific to me when it should be the most flexible. Why must timestamp_48
be set to 0 for 32 bit timestamps? These bits could be used for a sequence counter instead. It is also odd that timestamp and sequence counter are required, however the node according to the spec is optional and can be filled with more timestamp and sequence counter data. I would have expected that UUIDv8 would just contain (in this order) timestamp, sequence counter and node, with each having a variable length greater than zero and the version and variant bits being fixed. I don't really understand why it needs to be more specific than that.
Version 6 changes how the node
and clockseq
fields behave. This is confusing, and also compromises one of the most important traits version 6 can have: symmetric encoding with version 1.
It's confusing because these fields behave in ways that are contradictory and overlapping between the two versions:
node
is stable and (mostly) unchanging in version 1. In version 6 it is generated randomly each time.clockseq
is also (mostly) stable in version 1, but must be set randomly if node
changes. In version 6 it's something completely different - a subsecond counter (which version 1 already allows for in low-order bits of the timestamp.)Of more concern, however, is loss of ability to deterministically encode a v6 id from any v1 id, and vice-versa. Imagine a distributed legacy system where different components of the system are being migrated from version 1 to version 6 over time. Not being able to deterministically encode ids means there must be a centralized authority that contains the mapping of each v1 id to each corresponding v6 id. This is fundamentally counter to the entire thesis on which UUIDs are based, and in all likelihood will be (if you'll pardon the expression) an enormous pain in the ass for developers, who very likely chose UUIDs to avoid the need for a central authority.
I believe the community as a whole will be much better served if the version 6 spec consist of nothing more than a description of how version 1 fields are to be laid out (... while ensuring version 1 and 6 ids may be symmetrically encoded.)
It would be much better to take "ULID with sequence" as the basis for RFC: https://github.com/Sofya2003/ULID-with-sequence
Draft 02, Section 4. Format
The UUID format is 16 octets; some bits of the eight octet variant field specified below determine finer structure.
The UUID format is 16 octets; The variant bits in conjunction with the version bits described in the next section in determine finer structure.
Text was copied from RFC4122 which has an open errata. We should fix it vs open ourselves for an errata.
https://www.rfc-editor.org/errata/eid4975
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unix_timestamp_100ns_u56 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unix_timestamp_100ns_u56 | filling_u8 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| var | random_u61 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| random_u61 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
part1
(64 bit):unix_timestamp_100ns_u56
: unix_timestamp in 100ns resolution, stripped down to 56 bit + BE byte order for sortablity;filling_u8
: usually random (can be non-random to keep monotonicity).part2
(64 bit):var
: 0b111 for interporability with RFC4122;random_u61
: must be random.I prefered to use variant = 0b111
, it leaves 125 bits for timestamp + random
.
But the format can be easily adapted for variant = 0b10x, version = 0b0111
(122 effective bits) by shrinking filling_u8.
Timestamp length/resolution can also be adjusted during the standard review.
With the current proposal, the timestamp overflow will occur in 2198. Nothing bad should happen at this moment.
Timestamp reuse will occur at about 2250. After that, the UUID will still work, only the collision resistance will be slightly reduced (but still better than ULID and other proposals that have a large epoch). There may be minor issues (e.g. reusing old partitions in the DBMS if UUID is used as partitioning key), but I assume that larger UUIDs will be used on this day;
The proposed format is mostly ULID-like, but:
It is very simple
unix_timestamp_100ns_u56 = tv_sec*10000000 + tv_nsec/100
filling_u8 = random
part1 = unix_timestamp_100ns_u56<<8 | filling_u8
random_u61 = random
part2 = 0b111<<61 | random_u61
UUID = part1 <<64 | part2
unix_timestamp_100ns_u56 = tv_sec*10000000 + tv_nsec/100
// the following line is for illustration only. real code must contain correct unsigned comparsion
// if (part1_prev - 256e6) < (unix_timestamp_100ns_u56<<8) <= part1_prev {
if part1_prev - (unix_timestamp_100ns_u56<<8) < (unsigned) 256e6 {
part1 = part1_prev + 1
} else {
filling_u8 = random
part1 = unix_timestamp_100ns_u56<<8 | filling_u8
}
part1_prev = part1
random_u61 = random
part2 = 0b111<<61 | random_u61
UUID = part1 <<64 | part2
This allows:
Draft 03, Section 5.2. Monotonicity and Counters
The general guidance is that applications should freeze the counter and wait for the timestamp to advance which ensures monotonicity is not broken.
The general guidance is that applications SHOULD freeze the counter and wait for the timestamp to advance which ensures monotonicity is not broken.
should > SHOULD
We should avoid using the term "clock sequence" in the version 7 section, as its use there conflicts with how it's defined in versions 1 and 6.
[Edit: Just realized version 6 redefines the "clock sequence" behavior. See #52. I don't think that changes the basic argument here for just calling it "counter"]
Specifically, in versions 1 & 6, a clock sequence is randomly initialized and stable unless the node
changes or the system clock regresses. Contrast that to version 7 where the clock sequence (counter) is initialized to zero at the start of each time interval, and increments each time a new ID is generated in that interval.
Just call it what it is: "counter".
In section 4.3.3 (UUIDv6 Node Usage) of this specification, there is the following text:
UUIDv6 node bits SHOULD be set to a 48-bit random or pseudo-random number. UUIDv6 nodes SHOULD NOT utilize an IEEE 802 MAC address or the [RFC4122], Section 4.5 method of generating a random multicast IEEE 802 MAC address.
May I ask what the motivation is for this section that tells me that I "SHOULD NOT"?
I know that using MAC addresses isn't fully secure, but to fix that, RFC 4122 also has Section 4.5. This section says if you cannot use or don't want to use MAC addresses, you can set one bit to 1 (that is always 0 in normal MAC addresses) and generate some 47-bit integer for the remaining bits. However, section 4.3.3 also says I "SHOULD NOT" use Section 4.5.
I don't see any problem with accepting Section 4.5, because setting that unicast/multicast bit makes it more secure (and the only difference is that your generated id isn't 48 bits, but just 47 bits).
Continuation of separate #24 thread
Question:
Should we redefine UUID variant bits 111 (E/F) which are currently "Reserved for future definition." as per the original RFC 4122 Section 4.1.1 source?
Proposal:
Possible text changes depending on feedback from this issue and #24
As such the Draft 01 goes from the current three definitions:
Name | Version | Variant | Description |
---|---|---|---|
UUIDv6 | 0110 | 10x (8/9/A/B) | UUIDv1 with Re-ordered Gregorian timestamp, explicit start sequence counter, no MAC address |
UUIDv7 | 0111 | 10x (8/9/A/B) | 36-bit Unix epoch timestamp, variable subsecond encoding up to nanoseconds using floating point math and fractions (38 bits allocated to subsecond precision). |
UUIDv8 | 1000 | 10x (8/9/A/B) | Relaxed implementation, any timestamp goes, future proof the specification, 122 bits to do as you desire with general guidelines of timestamp, sequence, random in that order |
To the possible four definitions in Draft 02:
Name | Version | Variant | Description |
---|---|---|---|
UUIDv1ε | 0001 | 111 (E/F) | UUIDv1 with Re-ordered Gregorian timestamp, explicit start sequence counter, no MAC address. (What is UUIDv6 in draft 01) |
UUIDv6 | 0110 | 10x (8/9/A/B) | 36-bit Unix epoch timestamp, variable subsecond encoding up to nanoseconds using floating point math and binary fractions (38 bits allocated to subsecond precision). (What is UUIDv7 in draft 01) |
UUIDv6ε | 0110 | 111 (E/F) | UUIDv6 with 36-bit Unix epoch timestamp, variable subsecond encoding up to nanoseconds using integers to represent total number of subseconds (30 bits allocated to subsecond precision). (Did not exist in draft 01) |
UUIDv7 | 0111 | 10x (8/9/A/B) | Relaxed implementation, any timestamp goes, future proof the specification, 122 bits to do as you desire with general guidelines of timestamp, sequence, random in that order. (What was UUIDv8 in draft 01) |
UUIDv8 | 1000 | 10x (8/9/A/B) | Goes away in draft 02 as it is no longer required. |
Question: Too many or too much?
Current Draft-02 implementation:
Alternatives Proposed:
All things Monotonicity and Counters to assist with sort ordering, db index locality, and same Timestamp-tick collision avoidance during batch UUID creation.
Sub-Topics: Counter position, length, rollover handling and seeding.
Extension of the thread: #58 (comment)
Required Reading: #41 (comment) and Research efforts
DEL (named after ASCII) obviously isn't the best name, but how about reserving the UUID that has all 128 bits set to one, in addition to the Nil UUID? The primary use case of the Del UUID is as a sentinel when processing time-ordered UUID versions, though I don't believe the new RFC needs to specify the semantics of the reserved Nil and Del. For unordered UUIDs, the single Nil value may suffice, but now that the UUID spec gets ordered versions, it should be a good timing to reserve the other end.
At first, I though it was a good idea to invalidate the both ends (MIN and MAX) of UUIDv7 values for the purpose of sentinel, but on second thought after reading the Nil UUID discussion in #58, I believe just reserving the Del UUID is much simpler.
This change is not supposed to break existing applications because 0b111 var
has been reserved for future definitions by the original RFC.
Reported by Daniel Gredler [email protected] on the mailer.
"without a consulting a centralized resource" -> "without consulting a centralized resource"
I was wondering if it would make sense to provide some kind reference implementations and test vectors like every hash function specification does?
For example for UUIDv6 every timestamp
, clock_seq
, and node
input -> some exact output
Then in Python a reference implementation could be something like this making it easy to test with the provided test vectors:
def uuid6(timestamp: int = None, clock_seq: int = None, node: int = None) -> UUID:
...
And then of course you could use timestamp
, clock_seq
, and node
values from the test vectors to check against your own implementations.
What's the current state of UUID v6 draft?
I think that UUID v6 (or other standardized alternative for UUIDs ordered by time) would be very important and useful for usage in various data storage systems like databases. However, there are still some "concerns and TODOs" listed in this repository, and IETF draft will expire in less than a month (on August 27, 2020). There also don't seem to be any big changes about UUID v6 draft for a few months.
Does this mean draft will be abandoned? If so, will there be any alternative in the near future? What should developers that want to use UUID-like primary keys and identifiers in databases without loosing performance due to unordered bits of current UUID versions?
Also, in case if draft expires, what does this mean? Can it be then resubmitted when it is ready?
Ugh. I brain-farted on what clock sequences were. I need to gather my thoughts on this. I'll submit a new issue if/when it's appropriate.
In the meantime just delete this one, please. Sorry for the bother.
This is already fixed in #14 but opening an issue to track until V02 is submitted to IETF since these are generating lots of feedback emails.
"4.5. UUIDv8 Layout and Bit Order":
Old:
UUIDv8 SHOULD only be utilized if an implementation cannot utilize
UUIDv1, UUIDv6, or UUIDv8. Some situations in which UUIDv8 usage
New:
UUIDv8 SHOULD only be utilized if an implementation cannot utilize
UUIDv1, UUIDv6, or UUIDv7. Some situations in which UUIDv8 usage
Thanks:
Ville Skyttä [email protected]
Blachly, James [email protected]
Stefano Zuccaro [email protected]
[Edit: 'did my math wrong. "90 bit" -> "86 bit" throughout]
Observation: Clock precision is not the same as clock accuracy. Even in systems that provide micro or nano-second precision timestamps, the actual time (when compared to "true" USNO naval time) is unlikely to be accurate to more than a few 10's of milliseconds. I.e. Where UUIDs are concerned subsecond timestamp information is arguably indistinguishable from a simple counter. This is evidenced by RFC4122's provision for low-precision system clocks:
A high resolution timestamp can be simulated by keeping a count of the number of UUIDs that have been generated with the same value of the system time, and using it to construct the low order bits of the timestamp.
Observation: In version 1, the main reason for having a distinction between node
and clock_seq
was because node
was expected to be a MAC address. However, that's proven to be a poor design choice and we all agree that a random node
is now preferred. As long as there's some reasonably stable, random bits somewhere in the UUID, there's not really much reason for an explicit node
field.
Observation: v1 clock_seq
is initialized randomly, and incremented as needed to insure uniqueness and monotonicity. It also has desirable behavior around system clock regressions.
With the above in mind, we can expand theclock_seq
field to include all of the bits fromnode
as well as the bits needed for subsecond timestamp precision:, giving us a UUID layout that looks something like this:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixts |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts | clock_seq | ver | clock_seq |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var | clock_seq | clock_seq |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| clock_seq |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ver
and var
fields are as we've seen previoiusly in RFC4122.
unixts
field is the 36-bit unsigned integer, unix epoch, 1-second resolution, field proposed elsewhere. (Allows for dates from 1970 to the year 4147).
clock_seq
is [effectively] an 86-bit, unsigned integer, encoded most-significant bits first, with ver
and var
fields injected in their usual positions.
clock_seq
behaves as follows:
clock_seq
bits are all initialized randomly, with the exception of the first (high-order) bit, which is set to 0.clock_seq
may be re-used for successive uuids, with the following provisions (it's worth noting that this behavior is consistent with how clock_seq
works in v1 uuids):
unixts
is unchanged (i.e. in the same clock cycle), clock_seq
MUST be incrementied by one (1)unixts
has advanced (i.e. a new clock cycle) and clock_seq
's high-order bit is zero, no change is neededunixts
has advanced (i.e. a new clock cycle) and clock_seq
's high-order bit is one, reinitialize as described aboveunixts
regresses, reinitialize as described above(Note; The high-order bit logic prevents clock_seq
from rolling over mid clock-cycle, which might otherwise break monotonicity.)
UUIDs are guaranteed to be monotonic and sort in time-creation order (maximizes DB locality)
The high-order bits of clock_seq
serve the roughly the same purpose as v1's node
- they provide a uniqueness guarantee across different systems for uuids generated with the same timestamp. Which bits are considered "high-order" and "stable" vs "low-order" and "dynamic" depends on the uuid creation rate and what the definition of "stable" is. For example, this table shows the number of high order bits that will remain "stable" for 1 day and 1 year at the given creation rates:
uuid creation rate | # of stable bits (1 day) | # of stable bits (1 year) |
---|---|---|
> 1/sec | 67 | 59 |
1,000 / sec (~1K) | 57 | 49 |
1,000,000 / sec | 47 | 39 |
1,000,000,000 / sec | 37 | 29 |
Premise: Despite the UUID name, "universally unique" values are only achievable with some sort of pre-arranged pattern that decides which systems will provide which values. Adding more bits and entropy to the random parts of a UUID reduces collision probability, but cannot eliminate it. And there is no such thing as "enough collision resistance for any use case anywhere". The use of the MAC address in UUIDv1 was an attempt to use this approach. The drawbacks with it were some security risk in exposing MAC addresses, and the fact that with virtual computing so commonplace these days MAC address uniqueness can no longer be guaranteed. But the core idea behind it is: if you want to guarantee uniqueness, you have to have some sort of global registry or rules system which ensures some part of the UUID will be different from any other system. I'll refer to this approach as "shared knowledge" for lack of a better term (i.e. the different machines on the network share the knowledge of the MAC address registry and depend on it for uniqueness guarantees).
Question/proposition: Should the draft/spec indicate some means guaranteeing global uniqueness through shared knowledge. While the approach of using the MAC address is flawed, it does address a vital need and expose a problem with the current proposal: We still don't have guaranteed universally unique identifiers.
Prior art: A number of "this MUST be unique everywhere" registry systems exist (it's different from UUID generation but the core problem is the same). There are probably a lot more such system than I'm aware of. But just to list a few:
I'm sure there are many more. But you see what's happening here? When people REALLY need uniqueness, they create a registry, because there isn't any other way to achieve uniqueness guarantees.
Possible approaches:
Thoughts? If we really want uniqueness, we need to confront the fact that there's only one way to guarantee it.
EDIT:
Looking at the language of RFC4122 again, it seems like they specifically wanted to avoid any central registry, but made an exception for MAC addresses:
One of the main reasons for using UUIDs is that no centralized
authority is required to administer them (although one format uses
IEEE 802 node identifiers, others do not). As a result, generation
on demand can be completely automated, and used for a variety of
purposes....
Indeed that's probably one of the main arguments against any such registry: It's inconvenient. And also, if we've learned anything from MAC addresses: If anything goes wrong with the system, using random data (which is generally fast, easy and pretty much always available) is better anyway, i.e. lower collision probability.
Some of the UUIDv7 implementations I've seen on GitHub have not implemented fractional seconds correctly. For instance, putting the number of milliseconds directly into the 12 bit field subsec_a
, thereby using a 0-999 subset of the full 0-4095 range.
To prevent this, could the initial descriptions of the subsec_a
and subsec_b
fields (around L568 of the -02.txt
draft) make it more explicit that these must use the full range, with an example of what not to do.
Currently this isn't spelled out (though discussed in issue #24). For instance:
Section 4.4.1. - UUIDv7 Timestamp Usage
Additional sub-second precision (millisecond, nanosecond,
microsecond, etc) MAY be provided for encoding and decoding in the
remaining bits in the layout. [note: but doesn't say how!]
Section 4.4.4.1. - UUIDv7 Encoding
All 12 bits of scenario subsec_a is fully dedicated to millisecond
information (msec). [note: it isn't clear what "fully dedicated" means here]
It's only once the reader gets down to L845 that the requirements are spelled out:
Section 4.4.4.2. UUIDv7 Decoding
Similarly as per Figure 2, the sub-second precision values lie within
subsec_a, subsec_b, and subsec_seq_node which are all interpreted as
sub-second information after skipping over the version (ver) and
(var) bits. These concatenated sub-second information bits are
interpreted in a way where most to least significant bits represent a
further division by two. This is the same normal place notation used
to express fractional numbers, except in binary. For example, in
decimal ".1" means one tenth, and ".01" means one hundredth. In this
subsec field, a 1 means one half, 01 means one quarter, 001 is one
eighth, etc. This scheme can work for any number of bits up to the
maximum available, and keeps the most significant data leftmost in
the bit sequence.
As an additional suggestion, it would be very helpful if the text could include some examples of uuids and their min/max implied timestamps, to serve as test cases in unit tests.
Already fixed in #14,
Old:
4.5.4. UUIDv6 Basic Creation Algorithm
New:
4.5.4. UUIDv8 Basic Creation Algorithm
Found by Stefano Zuccaro [email protected] on the mailer
Freeze the counter until it overflows. Fatal run-time errors are not allowed. Freezing the counter may only result in a slight decrease in database performance
Sequence 4.5.4 references time_or_node
multiple times.
Examples:
10. Format by concatenating the 128-bits as: timestamp_32|timestamp_
48|version|time_or_node|variant|seq_or_node|node
6. Set 12-bit time_or_node to the 12 least significant bits from
the timestamp
Section 4.5 only describes time_or_seq
and every other section under Section 4.5 only references time_or_seq
.
Quantum-mechanical TRNG or CSPRNG is a must.
It means secrets.randbits() for Python and crypto/rand for Golang
And I don't expect a very long lifespan for 128 bit UUIDs. It is very likely that in 50 or 100 years, 256-bit identifiers (or something radically new) will be used.
160-bit identifiers would be great right now. By the way, 160-bit identifiers in Crockford's base32 would be the same length as 128-bit identifiers in usual UUID string format. Therefore they are compartible.
Originally posted by @sergeyprokhorenko in #23 (comment)
Draft 03, Section 5.6. Unguessability
No change to current text
Add:
Advice on generating cryptographic-quality random numbers can be found in RFC4086
From RFC 4122 Errata: https://www.rfc-editor.org/errata/eid3641
In examples and layouts, lengthen the counter to 15 bits for millisecond precision. 12 bits is too small. Here is the proof: #40 (comment)
UUID1 and UUID6 can be used with 128-bit signed integers, but UUID7 cannot.
UUID1 and UUID6 use 60 bits for the timestamp. If the timestamp is unsigned, the maximum possible year is around 5235 AD (2^60/10^7/60/60/24/365.25 + 1582 = ~5235). But the RFC-4122 says that "since a UUID is a fixed size and contains a time field, it is possible for values to rollover around A.D. 3400, depending on the specific algorithm used". It indicates that the timestamp in the RFC-4122 is signed (2^59/10^7/60/60/24/365.25 + 1582 = ~3408).
UUID7 timestamp is a 36 bit unsigned integer. It can contain dates up to 4147 AD (2^36/60/60/24/365.25 + 1970 = ~4147). If a UUID7 is stored in a signed 128-bit integer, it could be sorted incorrectly after 3059 AD (2^35/60/60/24/365.25 + 1970 = ~3059). I think this could be a problem, at least hypothetically.
Some languages and databases have signed and unsigned integer primitives:
But some languages have the signed primitive only:
Rust language has both signed and unsigned 128-bit integers:
There's also an old discussion about 128 primitive in Go: golang/go#9455
https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-02#section-4.4.3
The UUIDv7 Node MAY
contain any set of data an implementation desires however the node
MUST NOT be set to all 0s which does not ensure global uniqueness.
In most scenarios the node SHOULD be filled with pseudo-random data.
Is that necessary? It creates a special case in pretty much every programming language.
For example in Python you have to create a node with something like the following.
node = 0
while node == 0:
node = secrets.randbits(56)
Instead of just writing.
node = secrets.randbits(56)
Hello,
Thank you for your work on trying to shepherd this draft through the standards track. I help maintain one of the Go UUID libraries, and I'm noodling on releasing a version that implements the draft so folks can begin playing with it and offering feedback.
I was starting to target revision 1, and noticed there are some changes in pending in the revision 2 WIP draft that would make me need to substantially refactor how I'm generating the UUID, specifically the change around how the clock sequence will be handled. No sweat, since I knew I'd be working with things that are subject to change. That said I wanted to see if there was more context behind the reason for that change, knowing that the V6 UUID is supposed to be very similar to the V1.
Ultimately it made me realize that my V6 generator would share even less code with my V1 generator, since the clock sequence would need to be generated differently. So I started to wonder if there is even value in having the V6 UUID in its current form at all, if we're going to make it even more different than V1, and whether we should just make the V7 the V6 and guide everyone towards that richer UUID value.
I'd love to hear the context behind the change, and your thoughts on whether the change devalues V6 UUID's goal of being similar to the V1 format.
The exhausting attempts to squeeze such a heterogeneous functionality into 128 bits gave me the seditious thought of JSONB format as a container for UUID. See https://postgrespro.com/docs/postgresql/13/datatype-json for JSONB format. It is binary and indexable.
Does anyone really need three new versions of the UUID format?
Hello,
I understand that this may be beyond the scope of what you intended with your draft, but since it's slated to update 4122, and it has an impact on UUIDv6, I thought I'd drop this here in case you wanted some well-intentioned scope creep. 😄
While going through and manually revalidating my implementation of UUIDv6, I realized the library I maintained was generating UUIDv1 clock sequences subtly different than how it is defined in RFC 4122. This library was forked from an earlier one, and so this implementation detail of my library originated there. Here is what section 4.1.5 of RFC 4122 says:
4.1.5. Clock Sequence
For UUID version 1, the clock sequence is used to help avoid
duplicates that could arise when the clock is set backwards in time
or if the node ID changes.
If the clock is set backwards, or might have been set backwards
(e.g., while the system was powered off), and the UUID generator can
not be sure that no UUIDs were generated with timestamps larger than
the value to which the clock was set, then the clock sequence has to
be changed. If the previous value of the clock sequence is known, it
can just be incremented; otherwise it should be set to a random or
high-quality pseudo-random value.
This says the clock sequence is used to help avoid duplicates that could arise when the clock is set backwards in time. What this means is that the number of UUIDs a stateful generator can generate is limited by the precision of the timestamp, whereas a stateless generator using a random clock sequence does not face the same limitations. That seems like an unintentional side effect of the wording of the RFC. Especially when the RFC says you should either fail to generate the UUID or block until it's safe to do so:
4.2.1.2. System Clock Resolution
The timestamp is generated from the system time, whose resolution may
be less than the resolution of the UUID timestamp.
...
If a system overruns the generator by requesting too many UUIDs
within a single system time interval, the UUID service MUST either
return an error, or stall the UUID generator until the system clock
catches up.
So the person who originally implemented the UUIDv1 generator I now maintain, did more than the RFC said to do to avoid this failure mode and also incremented the clock sequence whenever a UUIDv1 has already been generated for this timestamp. This effectively removed the limitation on the number of UUIDv1s the stateful generator could generate, albeit while technically violating the language of the RFC.
if timeNow <= g.lastTime {
g.clockSequence++
}
After I discovered this, I became curious about other UUIDv1 implementations in Go, and other languages, and I came to find that every stateful UUIDv1 implementation I looked at chose to not follow this section of the RFC as it was written. Instead, they either implemented it the same as the code I maintain, or they took the simpler approach of incrementing the sequence on every generation. The C implementation provided in RFC 4122 was implemented as the RFC itself was written, but that was the only stateful one I found doing so.
So that makes me wonder if we should use this opportunity to update the clock sequence generation language from 4122, so that UUIDv6 stateful generator implementers aren't also encouraged to violate the RFC as it's currently written.
I'd be interested in your thoughts.
Draft 02, Section 4.1. Variant and Version Fields
My Suggested edits that will make this a bit more clear:
UUIDv7
/UUIDv8
/Version 7
/Version 8
change to UUIDv7ε
/UUIDv8ε
/Version 7ε
/Version 8ε
Changes from #26 (comment)
One of the benefits of the sub-second fraction representation is that we no longer need to stick to decimal clock resolutions like msec, usec, and nsec. We can employ some exotic precision resolutions such as ~0.25 milliseconds (1 / 4096 seconds) and ~5 microseconds. Please take a look at the following quick C code that prints the leading 48 bits of a UUIDv7:
#include <stdint.h>
#include <stdio.h>
#include <time.h>
int main() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
uint64_t timestamp = (uint64_t)tp.tv_sec << 12;
// compute 12-bit (~0.25 msec precision) fraction from nsecs
timestamp |= ((uint64_t)tp.tv_nsec << 12) / 1000000000;
printf("%08llx-%04llx\n", timestamp >> 16, timestamp & 0xFFFF);
return 0;
}
The above code truncates microsecond-precision (in my environment) sub-seconds to ~0.25-millisecond precision in order to fully utilize the 12-bit subsec_a space. With the sub-second fraction representation, truncating a decimal timestamp to a convenient size is always a good option to pack the timestamp into a binary field efficiently.
I think the RFC should include this approach in one way or another because, though some implementers might invent similar techniques, many might naively implement the microsecond-precision example layout as in the draft 02 and just waste ~4 bits. Also, this approach might help resolve the endless debate over the length of sub-second fields like #24.
There's a lot to like about Draft 2, version 7 UUIDs. They solve many of the main issues people have raised in version 1. I especially appreciate the fractional binary format for subsecond values. However there are two issues that continue to nag at me.
Problem 1: The layout of time/counter bits .vs. random bits is ambiguous. For example, if you examine the 3 example layouts (Draft 2, §4.4.4.1), you'll note that the high order bits of the subsec_seq_node
field may be used for subsecond time and seq
information (Fig 4 & 5), or simply set randomly (Fig 3). Unfortunately there's no way of distinguishing which is which. This may prove problematic where such IDs become comingled and need to be sorted or sharded within a database.
Problem 2: Dedicating a fixed number of bits to monotonically increasing state (subsec_a
and subsec_b
fields) leads to endless debate about whether or not the size of those fields is insufficient, sufficient, or excessive. Use cases that require low rates of UUID creation result in wasted bits (bits that rarely change), while cases with high creation rates result in possible counter overflow issues. We have seen arguments from both sides of this problem. (I suspect most of us have argued both sides of this problem!)
Proposal:
[Note: This is highly derivative of the current Draft 2 proposal and incorporates ideas that have been floated in other comments elsewhere. Please forgive any verbiage that may sound like I'm claiming credit for any of this as being particularly original.]
There are two parts to this proposal...
Part 1: This is more a matter of semantics, but I believe being explicit about this will simplify both the spec, and the discourse we're having around it. Basically I'd like to take §4.2.1.2 of the original RFC where it talks about how a counter may be used to populate the low-order bits of the timestamp and generalize it thusly:
"Subsecond time" refers to a combination of two pieces of state. The high order bits MUST be obtained from the subsecond state provided by the system clock (expected to be at least millisecond precision). Any remaining low order bits MUST be obtained from a count of the number of IDs generated in the current system clock time interval. The distribution of bits between system clock vs UUID counter will be implementation dependent.
(Where exactly this goes in the spec is TBD, but maybe replace section 4.4.2 of Draft 2...?)
The idea here is that by explicitly incorporating the notion of a "counter" into "subsecond time", we can do away with debate about how many bits are / aren't allocated to seq
or counter
fields and, instead, focus on the rate at which IDs may be generated which is, after all, what people really care about.
However, in order to do that we need the second part of this proposal...
Part 2: tl:dr; Define a handful of "subsecond time" precisions and layouts that allow for a reasonable range of creation rates, and identify these layouts using a [new] "pr
" field. The pr
field can be small (two bits), and will allow implementors to select a layout most appropriate for their use case, thereby maximizing how the the remaining bits are distributed between subsecond time and random entropy. Specifically, if pr
is a two-bit field, we can use it to define the following layouts:
pr |
subsecond field size (max ids per second) | Bit allocation examples (system clock : counter) |
---|---|---|
0b00 | 10 bits (1024) | 10 : 0 (msec clock, no counter) 0 : 10 (1-second clock, 10-bit counter) |
0b01 | 20 bits (1 x 106) | 10 : 10 (msec clock, 10-bit counter) 20 : 0 (usec clock, no counter) |
0b10 | 30 bits (1.1 x 109) | 10:20 (msec clock, 20-bit counter 20: 10 (usec-clock, 10-bit counter) |
0b11 | 40 bits (1.1 x 1012) | 20:20 (usec clock, 20-bit counter) 30: 10 (nsec-clock, 10-bit counter) |
Location of the pr
field is TBD (and I'm not sure how important this decision is), but immediately following the var
field seems like an obvious choice. This would allow for the following layouts:
Layout 0b00 (10-bit subsec field)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixts |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts | subsec |ran| ver | ...dom |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var|0 0| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Layout 0b01 (20-bit subsec field)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixts |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts | subsec | ver | subsec | rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var|0 1| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Layout 0b10 (30-bit subsec field)
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixts |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts | subsec | ver | subsec |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var|1 0|subsec | rand | rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Layout 0b11 (40-bit subsec field)
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unixts |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts | subsec | ver | subsec |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var|1 1| subsec | rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
[NOTE: subsec
field here uses fractional-binary format as described in Draft 2]
Draft 2 Version 7 fixes the number of dedicated subsecond time bits at 24 (subsec_a
and subsec_b
). This allows for a generation rate of ~4M uuids/second. That's pretty much it. Creating IDs faster than that requires setting state in the subsec_seq_node
field, which becomes meaningless as soon as the ids in question are comingled with IDs that may have been generated with other layouts. And in cases where IDs are created slowly (< 1 / msec), 12 bits go unused, resulting in only 62 bits of entropy to help avoid collisions.
Contrast to this proposal where the fast case allows for 10 trillion uuids/second (pr
= 0b11) with 46 bits of entropy. And, in the slow case (pr
= 0b00), allows for 1000 uuids /second and 74 bits of entropy. And has the added benefit of allowing recipients of IDs to identify exactly which bits are and are not monotonic in nature.
Note: Using the term "proprietary format" here to refer to any uuid formats that are user-specific in a way that doesn't [yet] merit formal RFC specification. (This notion comes from the world of MIME types, where the MIME standard allows for non-standard types, given a suitable "tree prefix" on the type string.)
In #30, I've argued that the current version 8 spec is too vague to carry much meaning. It defines timestamp
, node,
and clockseq
fields in the most general way possible, allowing for arbitrary bit lengths in all fields. But, in doing so, fails to meaningfully define any of these fields. Furthermore, the spec is prefaced with a variety of provisions about when version 8 should and should not be used, all of which scream, "ONLY USE THIS FOR EXPERIMENTAL OR PROPRIETARY FORMATS".
In short, I believe version 8 is at risk of falling afoul of the classic "wanting the best of both worlds, but getting the worst" design process. It's a spec that is overly strict for proprietary use cases, and overly vague for timestamp cases. Thus, I believe the right approach is to combine the existing versions 6-8 into a single, new, timestamp uuid version, as discussed in issue #30).
Then, for use cases that require a proprietary format, have a new version that is dedicated to that purpose, and that makes as few assumptions about the nature of the contained proprietary format as possible. In other words, the spec should boil down to the following:
variant
bits as defined in RFC4122version
bits set to 1000
(version 8, or whatever version # ends up being used)Question: Factions of a second representation or specific number of milliseconds/microseconds/nanoseconds for subsecond precision encoding within UUIDv7's subsec_a
, subsec_b
, and subsec_seq_node
Current draft-02 Implementation as per @bradleypeabody (who wrote the UUIDv7 section)
Sources from email between Brad and I on the topic:
[UUIDv7] supposed to be a fraction as opposed to the specific number of milliseconds.
If you want to indicate .75 seconds (three quarters of a second), you would write in decimal, following the period, the digits "75", right?
Well you do the exact same thing in binary, except that instead of "7" meaning 7 tenths, plus "5" meaning 5 hundredths, you do:
11
which means 1 half, plus one quarter.
Or, put another way, in decimal the digits go, from left to right:
tenths
hundredths
thousandths
ten thousandths
etc.
In binary it goes
halfs
quarters
eights
sixteeths
etc
To encode a fraction of a second s (expressed as a rational number in the range [0-1), e.g. 0.321 seconds) into a bit field of length n, calculate m as the maximum integer value representable with n bits plus one (aka 2^n, 1<<n, or "an int with the lower n bits set to 1, converted to a float64"). And then the formula is simply: s / (1/m) - the remaining fractional part is truncated and the number encoded as an integer. (e.g. 0.321 seconds encoded with 12 bits would be 0.321 divided by (1/4096) = 1314.816, truncated and expressed an integer value gives 1314 or in hex 0x522 or in binary 010100100010)
Taking a page from how floating point numbers are encoded, we could express fractions of a second as a bitfield divisor of whatever length a particular implementation uses. E.g. JS implementations which only have millisecond precision can reliably encode this information into 12 bits (probably 10 or 11 actually, but I digress). The first bit set to 1 would mean 0.5 seconds, two bits set to 1 and 1 would mean .75 seconds, a 0 and then 1 would be 0.25 seconds, etc.
One down side I see is that it requires floating point math (or some system that can represent fractional numbers) for encoding and decoding, which perhaps on some embedded systems may not necessarily be readily available - not sure how big a deal that concern is. It's also slightly more complicated than dealing with integer math, but I don't think to an extent that is problematic.
Alternative is specific number of milliseconds/microseconds/nanoseconds as integer.
Sources: https://en.wikipedia.org/wiki/Orders_of_magnitude_(time)
More Discussion on the alternative:
If alternative was utilized in Draft 02 we would encode as such:
subsec_a
is fully dedicated to millisecond information with a pad of two bits at the start since we only need 10. Easier to do this padding since Version bits come up right after.subsec_a
and subsec_b
utilized for microsecond information with 4 bits padded at the start since we only require 20. Easier to pad here because Variant bits are coming up.subsec_a
, subsec_b
, and 6 bits of subsec_seq_nod
used for nanosecond information. Final 48 bits of subsec_seq_nod
are used for random/entropy/node data. No padding required for this scenario.@kyzer-davis and @edo1 @sergeyprokhorenko @fabiolimace @nerg4l @broofa @nurked (hopefully I didn't miss anyone).
I typed up a summary of what I suggest should go into a new draft, along with the rationale.
https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md
I tried to keep it as succinct as possible and had to make some decisions, I'm sure some people will disagree on various things.
It would be great if we could review these issues being discussed against that and see how close this is to addressing everyone's concerns and try to start narrowing down what specifically should be in the next draft. The detailed discussion by issue can stay on the individual issue threads to keep them focused but if there are specific changes in the doc above and help move us toward an exact list of items to address, definitely let me know.
Hi, I'm a ULID user and want to know progress on this UUID v6 draft.
The draft seems expired, so I guess that new draft need to be submitted again:
https://tools.ietf.org/html/draft-peabody-dispatch-new-uuid-format-00
Is my understanding correct? If so, could you share plan on coming actions? Thanks im advance! 😀
Saw discussion here around verbiage used in section 7. Need to update verbiage to be less confusing.
https://news.ycombinator.com/item?id=28088213
Old:
This machine ID MUST be placed in the UUID proceeding the timestamp
New:
This machine ID MUST be placed in the UUID after the timestamp
In the current draft the section 2.3 says that the random component of 64 bits may be:
(a) filled with random bits; or
(b) filled with application defined values for uniqueness.
I have some considerations about it:
(0) I would like to remember that to be consistent with the other RFC-4122 versions, the first 2 bits of the random component should be set to ONE and ZERO, respectively.
(1) For the item (a) I propose that, in the case that more than one UUID v6 is generated within the same timestamp, the random component is incremented by 1. This is the approach adopted by the ULID specification*.
(2) For the item (b) I also propose to recommend that the 14 bits (2^14 = 16384) that are used for clock sequence in the version 1 could be used to embed an application defined value for uniqueness. The name continues the same, but the purpose changes, and the application decides what to put in these bits. I think that it could be easier to the existing libraries that already implement clock sequence. But in this case the clock sequence should not be incremented.
So the structure could be one of these two:
(1) In the LSB, 2 bits are fixed, and the remaining 62 are random
|-------MSB--------|-------LSB-------|
00000000-0000-6000-8000-000000000000
|------------------|-----------------|
timestamp random
(2) In the LSB 2, bits are fixed, 14 bits are uniquifier, the remaining 48 are random
|-------MSB--------|-------LSB-------|
00000000-0000-6000-8000-000000000000
|------------------|----|------------|
timestamp clockseq random
0: any value
6: fixed version
8: fixed variant (2 first bits are ONE and ZERO)
(3) As a consequence of the item (1), the multicast bit is not required anymore as in the version 1. Is it necessary to mention that in the draft?
Footnote:
Treat UUID like a black box:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.