Vision&Language Finetuning

Prompt Learning for Vision-Language Models

This repo contains the code for adapting vision-language models like CLIP to downstream datasets. The arxiv preprint is at:

How to Install

We recommend to install the environment through conda and pip. You should make a new environment with python>=3.9, for example:

conda create -n vl_finetuning python=3.9

Next, you can download pytorch from official site, for example:

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Next, run pip install -r requirements.txt in this repo to install a few more packages required by CLIP. You don't need to install dassl.

Follow DATASETS.md to install the downstream datasets. Note that we use the original split of data (including the few-shot splits for seed 1-3, except for ImageNet) to ensure a fair comparison to CoOp.

Configuration

Similar to CoOp, we use yacs to specify all experiment configurations for reproducibility. The root configuration file can be found at engine/config/default.py, and you may want to modify the _C.DATA_DIR path to where you install all the datasets. Note that this config is different from the default one in CoOp codebase; for simplicity, we remove irrelevant configuration for semi-supervised learning and only keep the ones for vision&language finetuning.

Split few-shot train/val sets

We provide few-shot train/val splits for seed 1, 2, 3, and shots 1, 2, 4, 8, 16 in indices/, as they were generated from the original CoOp codebase (except for ImageNet). If you want to generate more splits with different shots and seeds, please refer to [split_for_few_shot.py]. You will need to specify a dataset config yaml file such as engine/config/datasets/imagenet.yaml, and a few-shot config yaml file such as engine/configs/few_shot/shot_16.yaml. Then run:

python split_for_few_shot.py --dataset-config-file config/datasets/imagenet.yaml --few-shot-config-file config/few_shot/shot_1.yaml SEED 1

Feature Extraction

You may use features.py to extract image and text features from a frozen CLIP model. You may specify the configuration for feature extraction in 4 yaml files. For example, run:

python features.py \
    --dataset-config-file config/datasets/dtd.yaml \
    --few-shot-config-file config/few_shot/shot_1.yaml \
    --image-encoder-config-file config/features/image/vitb16_layer_all.yaml \
    --text-encoder-config-file config/features/text/layer_0.yaml \
    --template-config-file config/features/template/single.yaml \
    --view-config-file config/features/view/view_1_ccrop.yaml \
    SEED 1

Or you can quickly extract features for multiple configuration yaml files via features.sh:

bash features.sh

Training

Mini-Batch Logistic Regression

python logreg_minibatch.py \
    --dataset-config-file config/datasets/imagenet.yaml \
    --few-shot-config-file config/few_shot/shot_16.yaml \
    --image-encoder-config-file config/features/image/rn50_layer_0.yaml \
    --text-encoder-config-file config/features/text/layer_0.yaml \
    --template-config-file config/features/template/single.yaml \
    --view-config-file config/features/view/view_1_ccrop.yaml \
    --cross-modal-config-file config/cross_modal/text_ratio_0.yaml \
    --logit-config-file config/logit/linear.yaml \
    --hyperparams-config-file config/hyperparams/logreg_minibatch/adamw.yaml \
    --architecture-config-file config/architecture/linear.yaml \
    SEED 1

bash logreg_minibatch.sh

Test feature extraction (for robustness test)

python test_features.py --dataset-config-file config/datasets/dtd_test.yaml --image-encoder-config-file config/features/image/rn50_layer_0.yaml --view-config-file config/features/view/view_1_ccrop.yaml

AudioCLIP feature extraction for ESC-50 dataset

We follow the instruction offered in official AudioCLIP codebase to extract the feature. We notice that the AudioCLIP head does not produce good audio features with eval() mode, so we extract the mode in train() mode with a batch size of 10. The ESC-50 dataset recommended 5-fold cross validation because the audio samples can be correlated within each of the 5 folds, so we follow the practice to offer 5 train/test split of ESC-50. For each split, one fold is used as trainset (400 audio samples per fold), and the rest 4 folds is used for evaluation.

Audio classification with AudioCLIP features

python audio_classification.py

Citation

If you use this code in your research, please kindly cite the following papers

linzhiqiu / vl_finetuning Goto Github PK

vl_finetuning's Introduction

Vision&Language Finetuning

Prompt Learning for Vision-Language Models

How to Install

Configuration

Split few-shot train/val sets

Feature Extraction

Training

Mini-Batch Logistic Regression

Test feature extraction (for robustness test)

AudioCLIP feature extraction for ESC-50 dataset

Audio classification with AudioCLIP features

Citation

vl_finetuning's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent