Giter Site home page Giter Site logo

timeflake's Introduction

Timeflake

PyPi Latest Version PyPi Downloads License

Timeflake is a 128-bit, roughly-ordered, URL-safe UUID. Inspired by Twitter's Snowflake, Instagram's ID and Firebase's PushID.

Features

  • Fast. Roughly ordered (K-sortable), incremental timestamp in most significant bits enables faster indexing and less fragmentation on database indices (vs UUID v1/v4).
  • Unique enough. With 1.2e+24 unique timeflakes per millisecond, even if you're creating 50 million of them per millisecond the chance of a collision is still 1 in a billion. You're likely to see a collision when creating 1.3e+12 (one trillion three hundred billion) timeflakes per millisecond.*
  • Efficient. 128 bits are used to encode a timestamp in milliseconds (48 bits) and a cryptographically generated random number (80 bits).
  • Flexible. Out of the box encodings in 128-bit unsigned int, hex, URL-safe base62 and raw bytes. Fully compatible with uuid.UUID.

* Please consider how the Birthday Paradox might affect your use case. Also read the security note on this readme.

Why?

This could be useful to you, if you're looking for a UUID with the following properties:

  • You want to have UUIDs in URLs that are not predictable (vs auto-increment integers).
  • They should be random, but roughly-ordered over time so that your MySQL/Postgres indices stay fast and efficient as the dataset grows.
  • And simple to use across multiple machines (no coordination or centralized system required).
  • It would be nice if they were compatible with standard 128-bit UUID representations (many libraries in Python handle uuid.UUID, but no third-party types).

Some existing alternatives which I considered:

  • UUIDv1 but the timestamp bytes are not sequential and gives away network information.
  • UUIDv4 but they're mostly random, and can mess up the performance on clustered indexes.
  • ULID but the approach to incrementing the sequence during the same millisecond makes it more predictable.
  • KSUID but it's 160-bit, so unfortunately not compatible with standard 128-bit UUIDs.

Usage

import timeflake

# Create a random Timeflake
flake = timeflake.random()
>>> Timeflake(base62='00mx79Rjxvfgr8qat2CeQDs')

# Get the base62, int, hex or bytes representation
flake.base62
>>> '00mx79Rjxvfgr8qat2CeQDs'

flake.hex
>>> '016fa936bff0997a0a3c428548fee8c9'

flake.int
>>> 1909005012028578488143182045514754249

flake.bytes
>>> b'\x01o\xa96\xbf\xf0\x99z\n<B\x85H\xfe\xe8\xc9'

# Convert to the standard library's UUID type
flake.uuid
>>> UUID('016fa936-bff0-997a-0a3c-428548fee8c9')

# Get the timestamp component
flake.timestamp
>>> 1579091935216

# Get the random component
flake.random
>>> 724773312193627487660233

# Parse an existing flake (you can also pass bytes, hex or int representations)
timeflake.parse(from_base62='0002HCZffkHWhKPVdXxs0YH')
>>> Timeflake('0004fbc6872f70fc9e27355a499e8b6d')

# Create from a user defined timestamp or random value:
timeflake.from_values(1579091935216, 724773312193627487660233)
>>> Timeflake('016fa936bff0997a0a3c428548fee8c9')

Spec

128 bits are used to encode:

  1. UNIX-time in milliseconds (48 bits)
  2. Cryptographically generated random number (80 bits)

For example, the timeflake 016fb4209023b444fd07590f81b7b0eb (hex) encodes the following:

016fb4209023  +  b444fd07590f81b7b0eb
      |                   |
      |                   |
  timestamp            random
  [48 bits]           [80 bits]

In Python:

flake = timeflake.parse(from_hex='016fb4209023b444fd07590f81b7b0eb')
flake.timestamp = 1579275030563  # 2020-01-17T15:30:30.563 UTC
flake.random = 851298578153087956398315

Alphabets

A custom base62 alphabet representation is included, modified to preserve lexicographical order when sorting strings using this encoding. The hex representation has a max length of 32 characters, while the base62 will be 22 characters. Padding is required to be able to derive the encoding from the string length.

The following are all valid representations of the same Timeflake:

int    = 1909226360721144613344160656901255403
hex    = 016fb4209023b444fd07590f81b7b0eb
base62 = 02i2XhN7hAuaFh3MwztcMd

You can convert a timeflake to any alphabet using the itoa (integer to ASCII) function:

from timeflake.utils import itoa

flake = timeflake.random()
itoa(flake.int, alphabet=timeflake.flake.HEX, padding=32)

Provided extensions

Django model fields

You can use timeflakes as primary keys for your models. These fields currently support MySQL, Postgres and Sqlite3.

Example usage:

from timeflake.extensions.django import TimeflakePrimaryKeyBinary

class Item(models.Model):
   item_id = TimeflakePrimaryKeyBinary()
   # ...

Peewee ORM

See this gist for an example.

Note on security

Since the timestamp part is predictable, the search space within any given millisecond is 2^80 random numbers, which is meant to avoid collisions, not to secure or hide information. You should not be using timeflakes for password-reset tokens, API keys or for anything which is security sensitive. There are better libraries which are meant for this use case (for example, the standard library's secrets module).

Note on privacy

Please be aware of the privacy implications that time based IDs can have. As Timeflake encodes the precise time in which the ID was created, this could potentially reveal:

  • User timezone.
  • Geographic location: If the client software creates multiple associated IDs at the same time (like an article and embedded media), then the differences in timestamps of the IDs can reveal the latency of the client's network connection to the server. This reveals user geographic location. This can also happen if the client creates a single ID and the server adds an additional timestamp to the object.
  • User identity (de-anonymizing)
    1. Most Android apps include Google's libraries for working with push notifications. And some iOS apps that use Google Cloud services also load the libraries. These Google libraries automatically load Google Analytics which records the names of every screen the users view in the app, and sends them to Google. So Google knows that userN switched from screen "New Post" to screen "Published Post" at time K.
    2. Some ISPs record and sell user behavior data. For example, SAP knows that userN made a request to appM's API at time K.
    3. Even if the posting app does not share its user behavior data with third-parties, the user could post and then immediately switch to an app that does share user behavior data. This provides data points like "userN stopped using an app that does not record analytics at time K".
    4. Operating Systems (Android, Windows, macOS) send user behavior data to their respective companies.
    5. Browsers and Browser Extensions send user behavior data to many companies. Data points like "userN visited a URL at example.com at time K" can end up in many databases and sold.
    6. Posting times combined with traffic analysis can perfectly de-anonymize users.
  • How long the user took to write the post. This can happen if the app creates the ID when the user starts editing the post and also shares a timestamp of the publication or save time.
  • Whether or not the user edited the post after posting it. This can happen if the posts's displayed time doesn't match the timestamp in the ID.
  • Whether or not the user prepared the post in advance and set it to post automatically. If the timestamp is very close to a round numbered time like 21:00:00, it was likely posted automatically. If the posting platform does not provide such functionality, then the user must be using some third-party software or custom software to do it. This information can help de-anonymize the user.

Supported versions

Right now the codebase is only tested with Python 3.7+.

Dependencies

No dependencies other than the standard library.

Contribute

Want to hack on the project? Any kind of contribution is welcome! Simply follow the next steps:

  • Fork the project.
  • Create a new branch.
  • Make your changes and write tests when practical.
  • Commit your changes to the new branch.
  • Send a pull request, it will be reviewed shortly.
  • In case you want to add a feature, please create a new issue and briefly explain what the feature would consist of. For bugs or requests, before creating an issue please check if one has already been created for it.

Contributors

Thank you for making this project better!

  • @mleonhard - documented privacy implications of time based IDs.
  • @making - Implemented Java version of Timeflake.
  • @Gioni06 - Implemented Go version of Timeflake.
  • @zzzz465 - Implementation of TS/JS version of Timeflake.
  • @bibby - Added extension for peewee ORM.
  • @sebst - Improved compatibility with standard UUID class.
  • @yedpodtrzitko - Codebase improvements.

Changelog

Please see the CHANGELOG for more details.

Implementations in other languages

License

This project is licensed under the MIT license. Please read the LICENSE file for more details.

Prior art

timeflake's People

Contributors

anthonynsimon avatar dependabot[bot] avatar gioni06 avatar making avatar salmela avatar sebst avatar yedpodtrzitko avatar zzzz465 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

timeflake's Issues

simple-django-history not compatible with TimeflakePrimaryKeyBinary

Lets say you have the following model and want to add history tracking to each FooModel object:

from simple_history.models import HistoricalRecords

class FooModel(models.Model):
   id = TimeflakePrimaryKeyBinary()
   history = HistoricalRecords(
        history_id_field=models.UUIDField(default=uuid.uuid4)
    )

It generates the following error during makemigrations:
venv/lib/python3.9/site-packages/timeflake/extensions/django/__init__.py", line 97, in deconstruct del kwargs["primary_key"] KeyError: 'primary_key'

However, I tried adding the following try/except around this code and the migration would generate without an issue:
image

Downside is that leads to this error when trying to execute the migration:
django.db.utils.ProgrammingError: multiple primary keys for table "requisition_data_historicalrequisitionmodel" are not allowed LINE 1: ...me" varchar(200) NULL, "history_id" uuid NOT NULL PRIMARY KE...

So at the end of the day any model that uses TimeflakePrimaryKeyBinary for a primary ky id field is unable to track object history using https://django-simple-history.readthedocs.io/en/latest/ which is a deficit.

I can manually adjust the migration after the fact to "work" (i.e. be a charfied or uuid field etc with primary_key = False) but django will not honor the change and all subsequent makemigrations commands will try to undo the changes.

Can this issue be investigated? Thanks!

is there any spec documentation in the readme?

Something similar to how https://github.com/ulid/spec#specification would be appreciated.

Also would be interested in how the conversion between timeflake to UUID and back may work (e.g. what extra field needed for lossless conversion from UUID back to timeflake.

Also any NULL or MAX value available?


On a side note... I wonder why timeflake and ULID have fixed time precision and random ID length. Would it have made sense to allow for developers to adjust timestamp precision or random bit length (or even shrink it if they feel like it)?
If I'm speculating on what this may look like, this is what I would come up with.

<48b: Timestamp><Ext Timestamp>...<X Random bits><Ext Random>...

Extended Timestamp:
0bXXXX_XXX0 - Timestamp Extension Completed (X= Random Bits)
0bTTTT_TTT1 - Next Byte is extended sub milisecond precision timestamp

Extended Randomness:
0b0XXX_XXXX - No Change
0b1RRR_RRRR - Next Byte is extended randomness

Changelog dates or Github releases?

I have a minor aversion to using projects that don't include dates in their changelog - I tend to follow the keepachangelog spec on my own projects, Curious if this is something that could be followed in the future?

Point out the privacy risks of timestamps in object IDs

Non-random object IDs have privacy issues. Developers need to learn about these before they choose to use non-random IDs over simple random IDs. How about adding a PRIVACY section to the readme?

Timeflake encodes the precise time that a user created an object. Timestamps in IDs can reveal:

  • User timezone.
  • Geographic location: If the client software creates multiple associated IDs at the same time (like an article and embedded media), then the differences in timestamps of the IDs can reveal the latency of the client's network connection to the server. This reveals user geographic location. This can also happen if the client creates a single ID and the server adds an additional timestamp to the object.
  • User identity (de-anonymizing)
    • Most Android apps include Google's libraries for working with push notifications. And some iOS apps that use Google Cloud services also load the libraries. These Google libraries automatically load Google Analytics which records the names of every screen the users view in the app, and sends them to Google. So Google knows that userN switched from screen "New Post" to screen "Published Post" at time K.
    • Some ISPs record and sell user behavior data. For example, SAP knows that userN made a request to appM's API at time K.
    • Even if the posting app does not share its user behavior data with third-parties, the user could post and then immediately switch to an app that does share user behavior data. This provides data points like "userN stopped using an app that does not record analytics at time K".
    • Operating Systems (Android, Windows, macOS) send user behavior data to their respective companies.
    • Browsers and Browser Extensions send user behavior data to many companies. Data points like "userN visited a URL at example.com at time K" can end up in many databases and sold.
    • Posting times combined with traffic analysis can perfectly de-anonymize users.
  • How long the user took to write the post. This can happen if the app creates the ID when the user starts editing the post and also shares a timestamp of the publication or save time.
  • Whether or not the user edited the post after posting it. This can happen if the posts's displayed time doesn't match the timestamp in the ID.
  • Whether or not the user prepared the post in advance and set it to post automatically. If the timestamp is very close to a round numbered time like 21:00:00, it was likely posted automatically. If the posting platform does not provide such functionality, then the user must be using some third-party software or custom software to do it. This information can help de-anonymize the user.

Subclass TimeflakePrimaryKeyBinary from UUIDfield

UUIDField is build into Django and should therefore be preferred as it will likely increase compat with other third party libraries such as DRF.

See previous issue #8 .

However, doing a subclassing now would likely result in migration incompatible changes (as the db type may change and foreign key relations could point to this). This would justify a Major Version increase according to SemVer even if only the django extension would be affected.

So, imho, there are two options:

  1. Do the subclass now and increase to 1.0.0
  2. Remove the django extension and make a new Python package like django-timeflake.

What do you think, @anthonynsimon ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.