Giter Site home page Giter Site logo

omnivore's Introduction

Omnivore: A Single Model for Many Visual Modalities

PWC
PWC
PWC
PWC
PWC

[paper][website]

OMNIVORE is a single vision model for many different visual modalities. It learns to construct representations that are aligned across visual modalities, without requiring training data that specifies correspondences between those modalities. Using OMNIVORE’s shared visual representation, we successfully identify nearest neighbors of left: an image (ImageNet-1K validation set) in vision datasets that contain right: depth maps (ImageNet-1K training set), single-view 3D images (ImageNet-1K training set), and videos (Kinetics-400 validation set).

This repo contains the code to run inference with a pretrained model on an image, video or RGBD image.

Usage

Setup and Installation

Omnivore requires PyTorch and torchvision, please follow PyTorch's getting started instructions for installation. If you are using conda on a linux machine, you can follow the following instructions -

conda create --name omnivore python=3.8
conda activate omnivore
conda install pytorch=1.9.0 torchvision=0.10.0 torchaudio=0.9.0 cudatoolkit=11.1 -c pytorch

We also require einops, pytorchvideo and timm which can be installed via pip -

pip install einops
pip install pytorchvideo
pip install timm

Model Zoo

We share checkpoints for all the Omnivore models in the paper. The models are available via torch.hub, and we also share URLs to all the checkpoints.

The details of the models, their torch.hub names / checkpoint links, and their performance is listed below.

Name IN1k Top 1 Kinetics400 Top 1 SUN RGBD Top 1 Model
Omnivore Swin T 81.2 78.9 62.3 omnivore_swinT
Omnivore Swin S 83.4 82.2 64.6 omnivore_swinS
Omnivore Swin B 84.0 83.3 65.4 omnivore_swinB
Omnivore Swin B (IN21k) 85.3 84.0 67.2 omnivore_swinB_imagenet21k
Omnivore Swin L (IN21k) 86.0 84.1 67.1 omnivore_swinL_imagenet21k

Numbers are based on Table 2. and Table 4. in the Omnivore Paper.

The models can be loaded via torch.hub using the following command -

model = torch.hub.load("facebookresearch/omnivore", model="omnivore_swinB")

The class mappings for the datasets can be downloaded as follows:

wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json 
wget https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json 
wget https://dl.fbaipublicfiles.com/omnivore/sunrgbd_classnames.json

Run Inference

Follow the inference_tutorial.ipynb tutorial locally or Open in Colab for step by step instructions on how to run inference with an image, video and RGBD image.

To run the tutorial you need to install jupyter notebook. For installing it on conda, you may run the follwing:

conda install jupyter nb_conda ipykernel ipywidgets

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo Hugging Face Spaces

Replicate web demo and docker image is available! You can try loading with different checkpoints here Replicate

Citation

If this work is helpful in your research, please consider starring ⭐ us and citing:

@article{girdhar2022omnivore,
  title={{Omnivore: A Single Model for Many Visual Modalities}},
  author={Girdhar, Rohit and Singh, Mannat and Ravi, Nikhila and van der Maaten, Laurens and Joulin, Armand and Misra, Ishan},
  journal={arXiv preprint arXiv:2201.08377},
  year={2022}
}

Contributing

We welcome your pull requests! Please see CONTRIBUTING and CODE_OF_CONDUCT for more information.

License

Omnivore is released under the CC-BY-NC 4.0 license. See LICENSE for additional details. However the Swin Transformer implementation is additionally licensed under the Apache 2.0 license (see NOTICE for additional details).

omnivore's People

Contributors

ak391 avatar imisra avatar mannatsingh avatar nikhilaravi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.