Giter Site home page Giter Site logo

commavq's Introduction

Source Video Compressed Video Future Prediction
source_video.mp4
compressed_video.mp4
generated.mp4

A world model is a model that can predict the next state of the world given the observed previous states and actions.

World models are essential to training all kinds of intelligent agents, especially self-driving models.

commaVQ contains:

  • encoder/decoder models used to heavily compress driving scenes
  • a world model trained on 3,000,000 minutes of driving videos
  • a dataset of 100,000 minutes of compressed driving videos

Task

Lossless compression challenge: make me smaller! $500 challenge

Losslessly compress 5,000 minutes of driving video "tokens". Go to ./compression/ to start

Prize: highest compression rate on 5,000 minutes of driving video (~915MB) - Challenge ended July, 1st 2024 11:59pm AOE

Submit a single zip file containing the compressed data and a python script to decompress it into its original form. Top solutions are listed on comma's official leaderboard.

Implementation Compression rate
pkourouklidis (arithmetic coding with GPT) 2.6
anonymous (zpaq) 2.3
rostislav (zpaq) 2.3
anonymous (zpaq) 2.2
anonymous (zpaq) 2.2
0x41head (zpaq) 2.2
tillinf (zpaq) 2.2
baseline (lzma) 1.6

Overview

A VQ-VAE [1,2] was used to heavily compress each video frame into 128 "tokens" of 10 bits each. Each entry of the dataset is a "segment" of compressed driving video, i.e. 1min of frames at 20 FPS. Each file is of shape 1200x8x16 and saved as int16.

A world model [3] was trained to predict the next token given a context of past tokens. This world model is a Generative Pre-trained Transformer (GPT) [4] trained on 3,000,000 minutes of driving videos following a similar recipe to [5].

Examples

./notebooks/encode.ipynb and ./notebooks/decode.ipynb for an example of how to visualize the dataset using a segment of driving video from comma's drive to Taco Bell

./notebooks/gpt.ipynb for an example of how to use the world model to imagine future frames.

./compression/compress.py for an example of how to compress the tokens using lzma

Download the dataset

  • Using huggingface datasets
import numpy as np
from datasets import load_dataset
num_proc = 40 # CPUs go brrrr
ds = load_dataset('commaai/commavq', num_proc=num_proc)
tokens = np.load(ds['0'][0]['path']) # first segment from the first data shard

References

[1] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).

[2] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[3] https://worldmodels.github.io/

[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[5] Micheli, Vincent, Eloi Alonso, and François Fleuret. "Transformers are Sample-Efficient World Models." The Eleventh International Conference on Learning Representations. 2022.

commavq's People

Contributors

adeebshihadeh avatar elithecoder avatar grekiki2 avatar hcnguyen111 avatar incognitojam avatar yassineyousfi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

commavq's Issues

Inference challenge

Is this still active? I made some progress (30-50% speed improvement with the same model), but it is still pretty far from 2x, so let me know if it is still relevant.

"No such file or directory" when using HuggingFace dataset downloader

On running the code sample from the README, or compression/compress.py, Python throws an FileNotFoundError: [Errno 2] No such file or directory: '3b41c0fa8959aea6c118e5714f412a2e_13.npy error (the files are not downloaded to the expected location). I did check ~/.cache/huggingface/datasets/commaai___commavq and found some .arrow files, but the names do not match up with those specified by the dataset. Some help troubleshooting this would be much appreciated. Thanks!

InvalidProtobuf

Error description:
InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /home/mojo/dev/commavq-challenge/commavq/models/decoder.onnx failed:Protobuf parsing failed.

This happens when I try to load the decoder model.
When I try the checker

onnx.checker.check_model('../models/decoder.onnx')

I get the following error:
ValidationError: Unable to parse proto from file: /home/mojo/dev/commavq-challenge/commavq/models/decoder.onnx. Please check if it is a valid protobuf file of proto.

I am not sure if the onnx file is broken or this is an onnx problem?

I tried onnxruntime-gpu and onnx 1.14 and 1.15, both times the same problem.

Anyone else experiencing the same problems?

Is maybe this the problem?
https://github.com/onnx/onnx/blob/09a4e65bb098164491b021ffe563a559fbc1a808/docs/ExternalData.md

i have a bug for gpt.iynb.Can you help me?

File "notebooks/gpt.ipynb" y = model.generate(idx,config.tokens_per_frame)
RuntimeError : Index put requires the source and destination dtypes match ,got Half for the destination and BFloat16 for the source.
I don't know what the problem is, I guess it might be an environmental issue。May I ask what is the runtime environment for your code?I found that I couldn't log in to gpt2m @ 12f0a5e. What is this?

How is the pose data used in conditioning during inference?

The pose data are given as 6 real valued numbers, per frame. How are these values used in conditioning the model?

The talk mentions the use of np.digitize to tokenize the poses, which would allow for the subsequent pass through the embedding layer. But what are the bin values?

And are the pose values prepended before the BOS token while inferencing for every frame we need to condition? Then what does this mean about the maximum context length of the model? How would we condition it again after doing it once in the starting?

Can't Download Pretrained Models

Command run:

git lfs fetch --all

Result:

fetch: 9 object(s) found, done.
fetch: Fetching all references...
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/commaai/commavq.git/info/lfs'

Maybe put them on academic torrents?

Request for 6-DOF Pose Information for Each Frame

I kindly request the project's author to provide 6-DOF pose information for each frame. The current implementation is useful, but I also need this additional data.

Please consider adding an option or API that allows accessing the 6-DOF pose information for every frame. This would greatly enhance the project's capabilities and benefit users with similar needs.

Thank you for considering this feature request.

Access to training code, provide incentives

This approach seems very interesting for other use cases.
For example, for my use case I would like to VQ videos of sign language, not of driving.
If the training code would be available (including fast video decoding, data augmentation, and training strategy), one could replicate your models on a different domain.

Why would you care?
If I can replicate this on my domain, I have high incentives to make this repo more performant.

[ONNXRuntimeError] : 7 : INVALID_PROTOBUF : while loading encoder.onnx

The protobuf version

(base) a@t4:~$ protoc --version
libprotoc 3.12.4

ONNX runtime

In [1]: import onnxruntime as ort

In [2]: ort.__version__
Out[2]: '1.16.0'

Python version

(base) a@t4:~$ python --version
Python 3.10.12

Traceback

In [6]: options = ort.SessionOptions()
   ...: provider = 'CUDAExecutionProvider'
   ...: session = ort.InferenceSession('/home/a/commavq/gpt2m/encoder.onnx', options, [provider])
/opt/conda/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
  warnings.warn(
---------------------------------------------------------------------------
InvalidProtobuf                           Traceback (most recent call last)
Cell In[6], line 3
      1 options = ort.SessionOptions()
      2 provider = 'CUDAExecutionProvider'
----> 3 session = ort.InferenceSession('/home/a/commavq/gpt2m/encoder.onnx', options, [provider])

File /opt/conda/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:419, in InferenceSession.__init__(self, path_or_bytes, sess_options, providers, provider_options, **kwargs)
    416 disabled_optimizers = kwargs["disabled_optimizers"] if "disabled_optimizers" in kwargs else None
    418 try:
--> 419     self._create_inference_session(providers, provider_options, disabled_optimizers)
    420 except (ValueError, RuntimeError) as e:
    421     if self._enable_fallback:

File /opt/conda/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:460, in InferenceSession._create_inference_session(self, providers, provider_options, disabled_optimizers)
    458 session_options = self._sess_options if self._sess_options else C.get_default_session_options()
    459 if self._model_path:
--> 460     sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
    461 else:
    462     sess = C.InferenceSession(session_options, self._model_bytes, False, self._read_config_from_model)

InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /home/a/commavq/gpt2m/encoder.onnx failed:Protobuf parsing failed.

eval.ipynb

"model" is not defined

on line 22
# your model here!
pred = model(x)

Inference Bounty

Hey @YassineYousfi!

Just learnt about this from your talk being (re)released, very cool!

Spent some time on the inference bounty, getting 0.2s per frame on 4090 and 0.28s on a 3090. May be more I can squeeze out, but not certain.

Some quick questions:

  • Is the bounty still open?
  • Does a pure PyTorch solution with some startup time due to compilation fit your criteria?
  • If yes to the above two, what's the standard process re: bounties?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.