Giter Site home page Giter Site logo

deepstack-vl's Introduction

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

In this work, we introduce DeepStack, a simple and effective strategy for providing informative visual information by stacking visual tokens from bottom to top, maintaining the same visual context length.

⏳ : News

  • [6/16] 🔥 Training and evaluation codes are released.
  • [6/06] 🔥 We released DeepStack. We propose to infuses visual tokens into different transformer layers without increasing the visual context length.

DeepStack LMM

teaser

Contents

Install

  1. Clone this repository and install packages
git clone [email protected]:MengLcool/DeepStack-VL.git
cd DeepStack-VL
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
  1. Install additional packages for llms-eval evaluation
cd lmms-eval/
pip install -e .
cd ../

pip install git+https://github.com/huggingface/huggingface_hub
huggingface-cli login --token your/hf/tokens

Train

# Coming soon

Evaluation

We provide a script to use lmms eval for evaluation. Your can use eval_tasks to specify the evaluation tasks.

# specify evaluation tasks
export eval_tasks=textvqa,chartqa,docvqa

# for ckpts with vicuna as LLM
bash scripts/eval_lmms.sh $CKPT vicuna_v1

# for ckpts with phi-3 as LLM
bash scripts/eval_lmms.sh $CKPT phi3_instruct

Architecture

arch The framework of DeepStack is quite simple: the main innovation lies in the DeepStack strategy that infuses visual tokens into different layers.

DeepStack-L: DeepStack for LLMs. Given an input image, we feed the tokens extracted from the low-resolution version to the input layer of LLM. Considering the 2D nature of images, we extra the neighbors from the high-resolution version and reorganize them into DeepStack, which are then fed to the consequent layers in LLMs.

DeepStack-V: DeepStack for ViTs. We apply similar sampling strategy but feed the visual tokens into the ViT layers of vision encoder.

Visualization

example

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{meng2024deepstack,
      title={DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs}, 
      author={Meng, Lingchen and Yang, Jianwei and Tian, Rui and Dai, Xiyang and Wu, Zuxuan and Gao, Jianfeng and Jiang, Yu-Gang}
      publisher={arXiv:2406.04334},
      year={2024},
}

deepstack-vl's People

Contributors

menglcool avatar

Stargazers

George avatar Rui Tian avatar  avatar Vanilla avatar  avatar Joserii avatar Rui Shao avatar Yuzhong Zhao avatar JJ Jiang avatar Xiaolong avatar  avatar Guan Dai avatar yao teng avatar  avatar SeeFun avatar Coobiw avatar kingfly avatar 賴祺清 avatar Gyanateet Dutta avatar Mohammad Reza Taesiri avatar 爱可可-爱生活 avatar Ruotian(RT) Luo avatar Qin Liu avatar Tiancheng Zhao (Tony)  avatar 唐国梁Tommy avatar  avatar Bohao Li avatar Guangkai Xu avatar Jianwei Yang avatar wengzejia1 avatar  avatar

Watchers

Lilong Wen avatar  avatar

Forkers

jungle-gym-ac

deepstack-vl's Issues

Training data

hello, very nice work! Would you like to release your training data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.