Giter Site home page Giter Site logo

awesome-embodied-multimodal-llms's Introduction

Awesome Embodied Multimodal LLMs
(Vison-Language-Action Models)

This is a collection of research papers about Embodied Multimodal Large Language Models (VLA models).

If you would like to include your paper or update any details (e.g., code urls, conference information), please feel free to submit a pull request or email me. Any advice is also welcome.

Table of Contents

Overview

Embodied Multimodal LLMs integrate vision information and action outputs into large language models (LLMs). Leveraging the rich knowledge and strong reasoning capabilities of LLMs, these models excel in interactively following human instructions, comprehensively understanding the real world, and effectively conducting various embodied tasks. They hold great potential to achieve Artificial General Intelligence (AGI).

Models

Title Introduction Date Code
Star
OpenVLA: An Open-Source Vision-Language-Action Model
image 2024-06-13 Github
Star
A3VLM: Actionable Articulation-Aware Vision Language Model
image 2024-06-11 Github
Publish
Embodied CoT Distillation From LLM To Off-the-shelf Agents
image 2024-05-02 -
Publish
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
image 2024-04-07 -
Star Publish
3D-VLA: A 3D Vision-Language-Action Generative World Model
image 2024-03-14 Github
Star Publish
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
image 2024-02-27 Github
Publish
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
image 2024-02-24 -
Star Publish
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
image 2024-01-16 Github
Star Publish
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
image 2023-12-24 Github
Star Publish
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
image 2023-12-12 Github
Star Publish
Towards Learning a Generalist Model for Embodied Navigation
image 2023-12-04 Github
Star Publish
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
image 2023-11-30 Github
Star Publish
An Embodied Generalist Agent in 3D World
image 2023-11-18 Github
Star Publish
Large Language Models as Generalizable Policies for Embodied Tasks
image 2023-10-26 Github
Publish
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
image 2023-07-28 -
Star Publish
Building Cooperative Embodied Agents Modularly with Large Language Models
image 2023-07-05 Github
Star Publish
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
image 2023-05-24 Github
Publish
PaLM-E: An Embodied Multimodal Language Model
image 2023-03-06 -

Datasets & Benchmark

Title Introduction Date Code
Star Publish
OpenEQA: Embodied Question Answering in the Era of Foundation Models
image 2024-06-17 Github
Star Publish
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
image 2024-04-15 Github
Star Publish
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
image 2023-12-26 Github
Star Publish
Holodeck: Language Guided Generation of 3D Embodied AI Environments
image 2023-12-14 Github
Publish
Learning Interactive Real-World Simulators
image 2023-10-09 -
Star Publish
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
image 2023-10-3 Github

awesome-embodied-multimodal-llms's People

Contributors

tulerfeng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.