🤖 VLM-RLAIF (ACL'24 Oral)

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback,
Daechul Ahn^1,3, Yura Choi^1,3, Youngjae Yu¹, Dongyeop Kang², Jonghyun Choi^3,†
¹Yonsei University, ²University of Minnesota, ³Seoul National University
^†Corresponding Author
ACL 2024 (To appear)

📣 News

[Aug 07, 2024] We update our trained lora checkpoint of reward model & policy model initialization to Hugginface
[Aug 06, 2024] Our model is available in HuggingFace Spaces!
[Jul 16, 2024] 🎙️ VLM-RLAIF has been selected for ✨oral presentation✨ at ACL 2024! See you in Bangkok 🇹🇭
[Jun 16, 2024] 🔥 Our next work on aligning large video multimodal model, i-SRT🚄, is now available [arXiv, code]
[May 31, 2024] 🥳 VLM-RLAIF is accepted to ACL 2024 !

👀 Overview

Abstract: Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). Previous approaches for VLMMs involve Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and additional learnable parameters. Here, aligning video with text, and vice versa, remains a challenge, primarily due to the insufficient quality and quantity of multimodal instruction-tune data compared to that of text-only. This discrepancy often results in alignments that poorly ground the video content. To address this, we present a novel alignment strategy that employs a multimodal AI system equipped with Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. Our approach uniquely integrates detailed video descriptions as context into a multimodal AI system during preference feedback generation to enrich the understanding of video content, a process we call context-aware reward modeling. Empirical evaluations on various video benchmarks demonstrate that our VLM-RLAIF outperforms existing approaches, including the SFT model.

Pipeline of VLM-RLAIF

🗃️ Dataset and Checkpoints

Check PREPARE_DATASET.md to prepare training & validation datasets

Model	Size	Checkpoint	corr.	detail.	context	temp.	const.
RLAIF	7B	SNUMPR/vlm_rlaif_video_llava_7b	3.63	3.25	4.00	3.23	3.32
SFT	7B	SNUMPR/vlm_sft_video_llava_7b	2.79	2.82	3.37	2.28	2.49

Lora Checkpoints (used to train the model w/ PPO)

Model	Size	Lora Checkpoint
Policy init	7B	SNUMPR/vlm_policy_init_7b_lora
Reward model	13B	SNUMPR/vlm_rm_13b_lora

Dataset Usage	Link
SFT (short)	SNUMPR/vlm_rlaif_datasets/SFT_short.json
SFT (long)	SNUMPR/vlm_rlaif_datasets/SFT_long.json
Preference dataset (for RM)	SNUMPR/vlm_rlaif_datasets/RM_13b_v1_dataset_39k.json
PPO init	SNUMPR/vlm_rlaif_datasets/PPO_init.json
RLAIF	SNUMPR/vlm_rlaif_datasets/RL_data.json

📊 Evaluation

Check PREPARE_DATASET.md to prepare training & validation datasets

Zero-shot QA

bash Evaluation/zeroshotqa/scripts/zeroshotqa_pipeline.sh

Video Generative Benchmark

bash Evaluation/scripts/videochatgpt_pipeline.sh

💻 Training w/ RLAIF

Refer to the RLAIF folder to train reward model, policy model, and do PPO

🔧 Data Generation

Available Soon

📚 Citation

@inproceedings{ahnCYKC24,
      author    = {Daechul Ahn and Yura Choi and Youngjae Yu and Dongyeop Kang and Jonghyun Choi},
      title     = {Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback},
      booktitle = {ACL},
      year      = {2024}
}

License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA

yonseivnl / vlm-rlaif Goto Github PK

vlm-rlaif's Introduction

🤖 VLM-RLAIF (ACL'24 Oral)

📣 News

👀 Overview

🗃️ Dataset and Checkpoints

📊 Evaluation

💻 Training w/ RLAIF

🔧 Data Generation

📚 Citation

License

Acknowledgement

vlm-rlaif's People

Contributors

Stargazers

Watchers

Forkers

vlm-rlaif's Issues

Recommend Projects

Recommend Topics

Recommend Org