Giter Site home page Giter Site logo

awesome-large-multimodal-agents's Introduction

Awesome Large Multimodal Agents

Last update: 02/01/2024

Table of Contents


Papers

Taxonomy

Type Ⅰ

  • CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

  • CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

  • ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github Star

  • HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github Star

  • Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github Star

  • Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github Star

  • AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github Star

  • M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github Star

  • VisProgram - Visual Programming: Compositional visual reasoning without training

  • DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github Star

  • ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github Star

  • GPT-Driver - GPT-Driver: Learning to Drive with GPT Github Star

  • LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github Star

  • MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github Star

  • AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github Star

  • DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github Star

  • GRID - GRID: A Platform for General Robot Intelligence Development Github Star

  • DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github Star

  • MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github Star

Type Ⅱ

  • STEVE - See and Think: Embodied Agent in Virtual Environment Github Star

  • EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github Star

  • MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github Star

  • LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github Star

  • GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github Star

  • WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models Github Star

  • Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github Star

Type Ⅲ

  • DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github Star

  • ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github

Type Ⅳ

  • JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github Star

  • AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github Star

  • MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github Star

  • Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github Star

  • WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github Star

  • DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github Star

Multi-Agent

  • MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github Star

  • MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation

  • Avis - avis: autonomous visual information seeking with large language model agent

Application

💡 Complex Visual Reasoning Tasks

  • ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github Star

  • HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github Star

  • Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github Star

  • Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github Star

  • AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github Star

  • LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github Star

  • GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github Star

  • MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github Star

  • M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github Star

  • VisProgram - Visual Programming: Compositional visual reasoning without training

  • DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github Star

  • Avis - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation

  • CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

  • CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

🎵 Audio Editing & Generation

  • Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github Star

  • MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github Star

  • AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github Star

  • WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github Star

🤖 Embodied AI & Robotics

  • JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github Star

  • DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github Star

  • Octopus - Octopus: Embodied Vision-Language Programmer from Environmental Feedback Github Star

  • GRID - GRID: A Platform for General Robot Intelligence Development Github Star

  • MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github Star

  • STEVE - See and Think: Embodied Agent in Virtual Environment Github Star

  • EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github Star

🖱️💻 UI-assistants

  • AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github Star

  • DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github Star

  • WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models Github Star

  • Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github Star

  • MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation

  • ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github Star

  • MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github Star

  • AutoDroid - Empowering LLM to use Smartphone for Intelligent Task Automation Github Star

  • GPT-4V-Act - GPT-4V-Act: Chromium Copilot Github Star

🎨 Visual Generation & Editing

  • LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github Star

  • MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github Star

🎥 Video Understanding

  • DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github Star

  • ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github

  • AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github Star

🚗 Autonomous Driving

  • GPT-Driver - GPT-Driver: Learning to Drive with GPT Github Star

  • DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github Star

🎮 Game-developer

Benchmark

  • SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github Star

  • VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github Star

  • GAIA - GAIA: a benchmark for General AI Assistants Github

awesome-large-multimodal-agents's People

Contributors

jun0wanan avatar zhjohnchan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.