Giter Site home page Giter Site logo

vl_finetuning's Introduction

Vision&Language Finetuning

Prompt Learning for Vision-Language Models

This repo contains the code for adapting vision-language models like CLIP to downstream datasets. The arxiv preprint is at:

How to Install

We recommend to install the environment through conda and pip. You should make a new environment with python>=3.9, for example:

conda create -n vl_finetuning python=3.9

Next, you can download pytorch from official site, for example:

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Next, run pip install -r requirements.txt in this repo to install a few more packages required by CLIP. You don't need to install dassl.

Follow DATASETS.md to install the downstream datasets. Note that we use the original split of data (including the few-shot splits for seed 1-3, except for ImageNet) to ensure a fair comparison to CoOp.

Configuration

Similar to CoOp, we use yacs to specify all experiment configurations for reproducibility. The root configuration file can be found at engine/config/default.py, and you may want to modify the _C.DATA_DIR path to where you install all the datasets. Note that this config is different from the default one in CoOp codebase; for simplicity, we remove irrelevant configuration for semi-supervised learning and only keep the ones for vision&language finetuning.

Split few-shot train/val sets

We provide few-shot train/val splits for seed 1, 2, 3, and shots 1, 2, 4, 8, 16 in indices/, as they were generated from the original CoOp codebase (except for ImageNet). If you want to generate more splits with different shots and seeds, please refer to [split_for_few_shot.py]. You will need to specify a dataset config yaml file such as engine/config/datasets/imagenet.yaml, and a few-shot config yaml file such as engine/configs/few_shot/shot_16.yaml. Then run:

python split_for_few_shot.py --dataset-config-file config/datasets/imagenet.yaml --few-shot-config-file config/few_shot/shot_1.yaml SEED 1

Feature Extraction

You may use features.py to extract image and text features from a frozen CLIP model. You may specify the configuration for feature extraction in 4 yaml files. For example, run:

python features.py \
    --dataset-config-file config/datasets/dtd.yaml \
    --few-shot-config-file config/few_shot/shot_1.yaml \
    --image-encoder-config-file config/features/image/vitb16_layer_all.yaml \
    --text-encoder-config-file config/features/text/layer_0.yaml \
    --template-config-file config/features/template/single.yaml \
    --view-config-file config/features/view/view_1_ccrop.yaml \
    SEED 1

Or you can quickly extract features for multiple configuration yaml files via features.sh:

bash features.sh

Training

Mini-Batch Logistic Regression

python logreg_minibatch.py \
    --dataset-config-file config/datasets/imagenet.yaml \
    --few-shot-config-file config/few_shot/shot_16.yaml \
    --image-encoder-config-file config/features/image/rn50_layer_0.yaml \
    --text-encoder-config-file config/features/text/layer_0.yaml \
    --template-config-file config/features/template/single.yaml \
    --view-config-file config/features/view/view_1_ccrop.yaml \
    --cross-modal-config-file config/cross_modal/text_ratio_0.yaml \
    --logit-config-file config/logit/linear.yaml \
    --hyperparams-config-file config/hyperparams/logreg_minibatch/adamw.yaml \
    --architecture-config-file config/architecture/linear.yaml \
    SEED 1
bash logreg_minibatch.sh

Test feature extraction (for robustness test)

python test_features.py --dataset-config-file config/datasets/dtd_test.yaml --image-encoder-config-file config/features/image/rn50_layer_0.yaml --view-config-file config/features/view/view_1_ccrop.yaml

AudioCLIP feature extraction for ESC-50 dataset

We follow the instruction offered in official AudioCLIP codebase to extract the feature. We notice that the AudioCLIP head does not produce good audio features with eval() mode, so we extract the mode in train() mode with a batch size of 10. The ESC-50 dataset recommended 5-fold cross validation because the audio samples can be correlated within each of the 5 folds, so we follow the practice to offer 5 train/test split of ESC-50. For each split, one fold is used as trainset (400 audio samples per fold), and the rest 4 folds is used for evaluation.

Audio classification with AudioCLIP features

python audio_classification.py

Citation

If you use this code in your research, please kindly cite the following papers

vl_finetuning's People

Contributors

linzhiqiu avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.