Giter Site home page Giter Site logo

ricklentz / tvlt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zinengtang/tvlt

0.0 0.0 0.0 5.46 MB

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022)

License: MIT License

Shell 0.07% Python 8.96% Jupyter Notebook 90.98%

tvlt's Introduction

TVLT

Zineng Tang*, Jaemin Cho*, Yixin Nie*, Mohit Bansal

Learning compact visual-linguistic Transformer representation from low-level continuous visual 👁 and audio👂 perception signal without assuming the prior existence of written texts or tokens

Introduction

Transformers for Vision-Language (VL) representation learning heavily rely on text-based inputs. (Some works use audio channel only as auxiliary channel)

TVLT takes audio and visual inputs for VL representation learning with minimal modality-specific design and without text-specific modules such as tokenization and automatic speech recognition (ASR).

TVLT is pre-trained with vision-audio mathcing and mask autoencoding (mask and then reconstruct the continuous input of video frames and audio spectrogram), following the previous idea of training scalable vision learners with mask autoencoding on images (the Vision-BERT).

TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters.

Install

Setup python environment

conda create -n TVLT python=3.8   # You can also use other environment.

Install pytorch, torchvision, and torchaudio

The following version have been tested.

  • torch 1.10.0 1.12.1
  • torchvision 0.11.1 0.12.1
  • torchaudio 0.10.0 0.13.1

You can try other version of pytorch but make sure that it will be compatible with your cuda and cudnn.

Install other dependencies

pip install -r requirements.txt

Demos

Getting familiar with TVLT by trying the following demos.

Training

Pretraining (Data + scripts) -> TVLT Pretraining

# Example
bash scripts/pretrain_mae_vam.sh

Finetuning on Downstream (Data + scripts) -> TVLT Finetuning

# Example
bash scripts/finetune_mosei.sh

Released Models

The model weights are hosted in Huggingface Hub.
If you have tried the demos, some models should have already been downloaded.

The details of each released TVLT models are described in the table below.

Training Input Format Component Link
Pre-trained on Howto100m + Yttemporal videos Video 👁+ Audio👂 Encoder + Decoder [link]
Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI sentiment analysis Video 👁+ Audio👂 Encoder + Classification Head [link]
Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI emotional analysis Video 👁+ Audio👂 Encoder + Classification Head [link]
{re-trained on Howto100m + Yttemporal videos+ASR, then finetuned on CMU-MOSEI emotional analysis Video 👁+ Text✍️ Encoder + Classification Head [link]

To be contined... (Stay tuned, more pre-trained variants coming soon)

Folder Structure

See Folder Structure

Updates

  • Initial Code Release
  • Notebook Demos
  • Colab
  • Release TTS question audios for VQA (We convert all the textual questions of VQAv2 to audio using Google TTS API.)

...

Recommanded Usage

In our experiment, we pre-train TVLT on HowTo100M and YTtemporal videos. However, we recommend to unlock the power of TVLT by pre-training TVLT on large-scale videos for more generic Vision-Language representation.
The resultant models can be either use to directly process video (with the audio channel) inputs such as audio-image/video retrieval, audio-VQA, TTS-based VQA or to extract visual-acoustic features for other tasks such as speech translation, multimodal content understanding, etc.

Citation

@inproceedings{tang2022tvlt,
  title     = {TVLT: Textless Vision-Language Transformer},
  author    = {Zineng Tang and Jaemin Cho and Yixin Nie and Mohit Bansal},
  booktitle = {NeurIPS},
  year      = {2022}
}

Acknowledgement

The idea of this paper is heavily inspired by Masked Autoencoders Are Scalable Vision Learners.
Our codebase is based on ViLT. We thank the authors for their open-source contributions.

Contact

Zineng Tang ([email protected])

tvlt's People

Contributors

easonnie avatar j-min avatar zinengtang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.