VIMA: General Robot Manipulation with Multimodal Prompts

[Website] [arXiv] [PDF] [Pretrained Models] [VIMA-Bench] [Training Data] [Model Card]

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. However, different robotics tasks are still tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We introduce VIMA (VisuoMotor Attention model, reads "v-eye-ma"), a novel scalable multi-task robot learner with a uniform sequence IO interface achieved through multimodal prompts. The architecture follows the encoder-decoder transformer design proven to be effective and scalable in NLP. VIMA encodes an input sequence of interleaving textual and visual prompt tokens with a pretrained language model, and decodes robot control actions autoregressively for each environment interaction step. The transformer decoder is conditioned on the prompt via cross-attention layers that alternate with the usual causal self-attention. Instead of operating on raw pixels, VIMA adopts an object-centric approach. We parse all images in the prompt or observation into objects by off-the-shelf detectors, and flatten them into sequences of object tokens. All these design choices combined deliver a conceptually simple architecture with strong model and data scaling properties.

In this repo, we provide VIMA model code, pre-trained checkpoints covering a spectrum of model sizes, and demo and eval scripts. This codebase is under MIT License.

Installation

VIMA requires Python ≥ 3.9. We have tested on Ubuntu 20.04. Installing VIMA codebase is as simple as:

pip install git+https://github.com/vimalabs/VIMA

Pretrained Models

We host pretrained models covering a spectrum of model capacity on HuggingFace. Download links are listed below.

200M	92M	43M	20M	9M	4M	2M

Demo

To run the live demonstration, first follow the instruction to install VIMA-Bench.Then we can run a live demo through

python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}

Here eval_level means one out of four evaluation levels and can be chosen from placement_generalization, combinatorial_generalization, novel_object_generalization, and novel_task_generalization. task means a specific meta-task. Please refer to task suite and benchmark for more details.

After running the above command, we should see a PyBullet GUI pop up, alongside a small window showing the multimodal prompt. Then a robot arm should move to complete the corresponding task. Note that this demo may not work on headless machines since the PyBullet GUI requires a display.

Paper and Citation

Our paper is posted on arXiv. If you find our work useful, please consider citing us!

@article{jiang2022vima,
  title   = {VIMA: General Robot Manipulation with Multimodal Prompts},
  author  = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
  year    = {2022},
  journal = {arXiv preprint arXiv: Arxiv-2210.03094}
}

stjordanis / vima Goto Github PK

vima's Introduction

VIMA: General Robot Manipulation with Multimodal Prompts

Installation

Pretrained Models

Demo

Paper and Citation

vima's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent