Giter Site home page Giter Site logo

mioupo / bubogpt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from magic-research/bubogpt

0.0 0.0 0.0 6.5 MB

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

Home Page: https://bubo-gpt.github.io/

License: BSD 3-Clause "New" or "Revised" License

Shell 0.36% Python 99.64%

bubogpt's Introduction

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

A multi-modal LLM capable of jointly understanding of text, vision and audio and grounding knowledge into visual objects.

[Project Page] [Arxiv] [Demo Video] [Gradio] [Data] [Model]

bubogpt_framework

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao*, Zhijie Lin*, Daquan Zhou, Zilong Huang, Jiashi Feng and Bingyi Kang† (*Equal Contribution, †Project Lead)
Bytedance Inc.

HuggingFace space

News🔥

2023/07/21 - Huggingface demo released!

Setup

Clone this repository and navigate to the current folder.

Environment

Our code is based on Python 3.9, CUDA 11.7 and Pytorch 2.0.1.

pip3 install -r pre-requirements.txt
pip3 install -r requirements.txt

Models

Follow the instruction to prepare the pretrained Vicuna weights, and update the llama_model in bubogpt/configs/models/mmgpt4.yaml.

## get pre-trained checkpoints
mkdir checkpoints && cd checkpoints;
wget https://huggingface.co/spaces/Vision-CAIR/minigpt4/resolve/main/blip2_pretrained_flant5xxl.pth;
wget https://huggingface.co/spaces/xinyu1205/recognize-anything/resolve/main/ram_swin_large_14m.pth;
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth;
wget https://huggingface.co/spaces/abhishek/StableSAM/resolve/main/sam_vit_h_4b8939.pth;
wget https://huggingface.co/magicr/BuboGPT-ckpt/resolve/main/bubogpt_7b.pth

For training, down load MiniGPT-4 checkpoint to checkpoints.

Data

Stage1

Stage2

Usage

Gradio demo

Run gradio demo with:

python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0

Training

Browse the dataset config folder, and replace the storage item with path/to/your/data for each dataset.

Stage 1: Audio pre-training

bash dist_train.sh train_configs/mmgpt4_stage1_audio.yaml

Stage2: Multi-modal instruct tuning

bash dist_train.sh train_configs/mmgpt4_stage2_mm.yaml

Demo

1. Image Understanding with Grounding

2. Audio Understanding

3. Aligned Audio-Image Understanding

4. Arbitrary Audio-Image Understanding

For more demonstrations, please refer to the examples.

Acknowledgement

This codebase is mainly developed based on the following repos:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.