CLIP prefix for video captioning.

implementation for the report "ClIP Prefix for Video Caption Generation"

Description

Contrastive models like CLIP have demonstrated impressive ability in learning robust and high quality visual represetations and have sparked many promising application directions. In this work, we try to leverage the visual embeddings produced by CLIP to tackle the problem of video caption generation. Video captioning is a fundamental task for vision-language understanding, where the model is asked to generate a piece of text description for an input video clip. This task is challenging as it requires wisdom from both video understanding and natural language generation. Therefore, we take advantage of both the high quality visual features produced by CLIP and a pre-trained language generation model, GPT2, to create a simple and light weight model for the video caption generation task. In our model, representation of video frames encoded by CLIP are transformed into prefixes of a sentence and sent to the language model to generate the corresponding caption. Experiments on a public video captioning dataset demonstrated the promising results of our simple method.

Demos for our Video Captioning


a girl is talking about how to make a mask	a man is driving a car in a car and	a band is performing a song on stage and a

Training prerequisites

Clone, create environment and install dependencies:

git clone https://github.com/juexZZ/NYUFall22-CVProject-CLIPVideoCap.git && cd CLIP_prefix_caption NYUFall22-CVProject-CLIPVideoCap
conda env create -f environment.yml
conda activate clip_prefix_caption

MSR_VTT training

Download video dataset

Extract CLIP features

Run 'CLIP_feature_extraction.ipynb'

Train only the feature transformation module

python train_vtt.py --mapping_type transformer --num_layers 8 --prefix_length_clip 28 --bs 40 --only_prefix --save_every 10 --epochs 10 \
--cross --out_dir cross_length20 --prefix_length 20

To fine-tune the GPT-2

python train_vtt.py --mapping_type transformer --num_layers 8 --prefix_length_clip 28 --bs 40 --save_every 10 --epochs 10 \
--cross --out_dir cross_length20 --prefix_length 20

To do inference with a trained model:

python inference_vtt.py --model_dir cross_length20 --prefix_length 20 --mapping_type transformer --cross --entry_length 10 \
--num_layers 8 --epoch 9

Acknowledgments

This repository is heavily based on CLIP, CLIPCap and Hugging-faces repositories. For training we used the data of MSR_VTT dataset

juexzz / nyufall22-cvproject-clipvideocap Goto Github PK

nyufall22-cvproject-clipvideocap's Introduction

CLIP prefix for video captioning.

implementation for the report "ClIP Prefix for Video Caption Generation"

Description

Demos for our Video Captioning

Training prerequisites

MSR_VTT training

Acknowledgments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent