Giter Site home page Giter Site logo

pink's Introduction

Pink: Unveiling The Power of Referential Comprehension for Multi-modal LLMs.

img

Contents

Pink Weights

Data Download

Pretraining Dataset

The pretraining dataset used in this release is the same as in LLaVA which is a subset of CC-3M dataset. Please see here for a detailed description on the dataset structure and how to download the images.

Instruction Tuning Dataset

Alt text

The datasets mentioned in the image need to be downloaded manually.

We also provide the converted dataset used in the instruction tuning:

https://huggingface.co/datasets/SY-Xuan/Pink_sft/

LLaMA2 Weight Download

Our model is based on Llama-2-7b-chat-hf. You need to download the weights manually.

Install

  1. Install Package
conda create -n pink python=3.10 -y
conda activate pink
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Training

Stage 1

Please refer to scripts/stage1.sh.

Stage 2

Please refer to scripts/stage2.sh.

Stage 2 with Object365

Please refer to scripts/stage2_with_object365.sh.

Self-consistent Bootstrapping

We convert the *.json of Object365. Please refer to dataset_generation/object365_detection.py

Bootstrapping

Please refer to scripts/object365_generate.sh.

Self-consistent

Please refer to pink/eval/object365_filter.py

Evaluation

Please refer to inference.ipynb and scripts/eval_refcoco.sh.

Demo

To launch a Gradio web demo, use the following command.

python demo.py --checkpoint-path /path/to/pink --llama-path /path/to/llama2

Citation

If you find Pink useful for your research and applications, please cite using this BibTeX:

@article{xuan2023pink,
  title={Pink: Unveiling the power of referential comprehension for multi-modal llms},
  author={Xuan, Shiyu and Guo, Qingpei and Yang, Ming and Zhang, Shiliang},
  journal={arXiv preprint arXiv:2310.00582},
  year={2023}
}

Acknowledgement

This code inherits some codes from LLaVA and Shikra. Thanks for these outstanding implementations.

Related Projects

LocLLM: We leverage LLM for the human keypoint localization. LocLLM shows remarkable performance on standard 2D/3D keypoint localization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset keypoint localization, and even detecting novel type of keypoints unseen during training.

Ant-Multi-Modal-Framework: This repository contains codes for multi-modality learning from the Multimodal Cognition group of Ant Group that have been integrated into AntMMF.

pink's People

Contributors

sy-xuan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.