implementation for the report "ClIP Prefix for Video Caption Generation"
Contrastive models like CLIP have demonstrated impressive ability in learning robust and high quality visual represetations and have sparked many promising application directions. In this work, we try to leverage the visual embeddings produced by CLIP to tackle the problem of video caption generation. Video captioning is a fundamental task for vision-language understanding, where the model is asked to generate a piece of text description for an input video clip. This task is challenging as it requires wisdom from both video understanding and natural language generation. Therefore, we take advantage of both the high quality visual features produced by CLIP and a pre-trained language generation model, GPT2, to create a simple and light weight model for the video caption generation task. In our model, representation of video frames encoded by CLIP are transformed into prefixes of a sentence and sent to the language model to generate the corresponding caption. Experiments on a public video captioning dataset demonstrated the promising results of our simple method.
![]() |
![]() |
![]() |
a girl is talking about how to make a mask | a man is driving a car in a car and | a band is performing a song on stage and a |
Clone, create environment and install dependencies:
git clone https://github.com/juexZZ/NYUFall22-CVProject-CLIPVideoCap.git && cd CLIP_prefix_caption NYUFall22-CVProject-CLIPVideoCap
conda env create -f environment.yml
conda activate clip_prefix_caption
Download video dataset
Extract CLIP features
Run 'CLIP_feature_extraction.ipynb'
Train only the feature transformation module
python train_vtt.py --mapping_type transformer --num_layers 8 --prefix_length_clip 28 --bs 40 --only_prefix --save_every 10 --epochs 10 \
--cross --out_dir cross_length20 --prefix_length 20
To fine-tune the GPT-2
python train_vtt.py --mapping_type transformer --num_layers 8 --prefix_length_clip 28 --bs 40 --save_every 10 --epochs 10 \
--cross --out_dir cross_length20 --prefix_length 20
To do inference with a trained model:
python inference_vtt.py --model_dir cross_length20 --prefix_length 20 --mapping_type transformer --cross --entry_length 10 \
--num_layers 8 --epoch 9
This repository is heavily based on CLIP, CLIPCap and Hugging-faces repositories. For training we used the data of MSR_VTT dataset