Giter Site home page Giter Site logo

lavin-original's Introduction

Setup

Install Package

  • Pytorch 1.12
source ~/anaconda3/etc/profile.d/conda.sh
conda create -n lavin python=3.9 -y
conda activate lavin

# install pytorch
# conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 -c pytorch
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

# install dependency and lavin
pip install -r requirements.txt
pip install -e .
  • Pytorch 2.1
source ~/anaconda3/etc/profile.d/conda.sh
conda create -n lavin-torch2.1 python=3.9 -y
conda activate lavin-torch2.1

# install pytorch 2.1
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# install dependency and lavin
pip install -r requirements-torch2.1.txt
pip install -e .

Data Preparation

  • For ScienceQA, please prepare the dataset from the official repo.
  • For Multimodal Chatbot, download the images in train2014 split from MSCOCO, and obtain the prepared 52k text-only and 158k text-image instruction-following data from here.
  • Obtain the weights of LLaMA from this form (official) or Download LLaMA-7B and LLaMA-13B from HuggingFace (unofficial).
  • If you want to use Vicuna weights to initialize the model, please download from here. After that, the file structure should look like:
LaVIN/
  |-- lavin
  |-- scripts
  |-- train.py
  |-- eval.py
  ......
data/
  |-- problem.json
  |-- pid_splits.json
  |-- captions.json
  |-- all_data.json
  |-- images
      |-- train2014      # MSCOCO 2014
      |-- val2014        # MSCOCO 2014
      |-- train          # ScienceQA train image
      |-- val            # ScienceQA val image
      |-- test           # ScienceQA test image
  |-- weights
      |-- tokenizer.model
          |--7B
              |-- params.json
              |-- consolidated.00.pth
          |--13B
              |-- params.json
              |-- consolidated.00.pth
              |-- consolidated.01.pth
          |--vicuna_7B
          |--vicuna_13B
              |-- config.json
              |-- generation_config.json
              |-- pytorch_model.bin.index.json
              |-- special_tokens_map.json
              |-- tokenizer_config.json
              |-- tokenizer.model
              |-- pytorch_model-00001-of-00003.bin
              |-- pytorch_model-00002-of-00003.bin
              |-- pytorch_model-00003-of-00003.bin
          ......

Fine-tuning

ScienceQA

Reproduce the performance of LaVIN-7B on ScienceQA. For 7B model, we fine-tune it on 2x A100 (we find that the performance will be affected by the number of GPUs. We are working to address this problem).

LLaMA weights:

bash ./scripts/finetuning_sqa_7b.sh

Vicuna weights:

bash ./scripts/finetuning_sqa_vicuna_7b.sh

LaVIN-lite with LLaMA weights (single GPU):

bash ./scripts/finetuning_sqa_vicuna_7b_lite.sh

Reproduce the performance of LaVIN-13B on ScienceQA (~2 hours on 8x A100 (80G)). For 13B model, we fine-tune it on 8x A100.

LLaMA weights:

bash ./scripts/finetuning_sqa_13b.sh

Vicuna weights:

bash ./scripts/finetuning_sqa_vicuna_13b.sh

LaVIN-lite with LLaMA weights (single GPU):

bash ./scripts/finetuning_sqa_vicuna_13b_lite.sh

MultiModal ChatBot

Fine-tune LaVIN-13B on 210k instruction-following data (~ 75 hours with 15 epochs and ~25 hours with 5 epochs on 8x A100 (80G) )

LLaMA weights:

bash ./scripts/vl_instruction_tuning_13b.sh

Vicuna weights:

bash ./scripts/vl_instruction_tuning_vicuna_13b.sh

To train on fewer GPUs, you can reduce the number of gpus in the scripts and increase gradient accumulation via --accum_iter to guarantee the total batch size of 32. Setting --gradient_checkpointing and --bits 4bit in the scripts will greatly reduce the requirements of GPU memory.

Demo

LaVIN supports both single- and multi-modal instruction inputs. Try your custom instructions in our demo:

  • Launch a gradio web server on your machine, then you can interact with LaVIN as you like.
torchrun --nproc_per_node 1 demo.py --server_name 127.0.0.1

Model Zoo

ScienceQA

Model Weights Time Memory #Params Acc Weights
LaVIN-7B-lite LLaMA 29 hours (single GPU) 9G 3.8M 88.35 google drive
LaVIN-13B-lite LLaMA 42 hours (single GPU) 14G 5.4M 89.44 google drive
LaVIN-7B LLaMA 1.4 hours 33.9G 3.8M 89.37 google drive
LaVIN-7B Vicuna 1.4 hours 33.9G 3.8M 89.41 google drive
LaVIN-13B LLaMA 2 hours 55.9G 5.4M 90.54 google drive
LaVIN-13B LLaMA 4 hours 55.9G 5.4M 90.8 -

Multimodal ChatBot

Model Weights Time Memory #Params Acc Weights
LaVIN-13B LLaMA 25 hours 55.9G 5.4M - -
LaVIN-13B LLaMA 75 hours 55.9G 5.4M - google drive

Examples

Star History

Star History Chart

Citation

If you think our code and paper helpful, please kindly cite LaVIN and RepAdapter:

@article{luo2023towards,
  title={Towards Efficient Visual Adaption via Structural Re-parameterization},
  author={Luo, Gen and Huang, Minglang and Zhou, Yiyi  and Sun, Xiaoshuai and Jiang, Guangnan and Wang, Zhiyu and Ji, Rongrong},
  journal={arXiv preprint arXiv:2302.08106},
  year={2023}
}

@article{luo2023cheap,
 title={Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models},
 author={Luo, Gen and  Zhou, Yiyi and Ren, Tianhe and Chen, Shengxin and Sun, Xiaoshuai and Ji, Rongrong},
 journal={arXiv preprint arXiv:2305.15023},
 year={2023}
  }

Acknowledgement

This repo borrows some data and codes from LLaMA, Stanford Alpaca, LLaVA, MiniGPT-4 and LLaMA-Adapter. Thanks for their great works.

lavin-original's People

Contributors

davidnvq avatar luogen1996 avatar rentainhe avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.