lezhang7 / enhance-finegrained Goto Github PK

[CVPR' 2024] Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding

License: Other

Makefile 0.03% Python 97.50% Shell 0.81% Jupyter Notebook 1.65%

enhance-finegrained's Introduction

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

🎉 The paper was accepted to CVPR 2024:

TL;DR: We Propose two losses on our generated hard negative examples to enhance model's compositional understanding ability for CLIP.

This repo forks from wonderful OpenCLIP, for model and training details, please refer to original repo.

☑️ Checkpoints

The checkpoints could be downloaded directly using gdown with following script:

pip install --upgrade --no-cache-dir gdown # must update gdown to avoid bugs, thanks to https://github.com/wkentaro/gdown/issues/146
gdown 1DWPw3CtGh5cHz9bW_-iXRSG7BBUVl13K #download checkpoint for CE-CLIP

Training

1. Generating Training dataset

The training data is generated based on COCO 2014, so you can either download by yourself and assign coco dataset_path in dataset.py or you can simply run following script to download and generate dataset

cd data/
bash prepare_dataset.sh

2. Training

you need to specify training parameters in scrips/run_all.sh such as --gres=gpu:a100:2 and batch_size, please refer to this script file to see more details, to simply run the training, using following scritps

cd scripts/
bash run_multiple_nodes.sh

The result checkpoint will be at Enhance-FineGrained/src/Outputs

Evaluation

We evaluate our method on four downstream task ARO, VALSE and VL-CheckList, and very recent SugarCrepe and we also provide evaluation code. However, one need go to official github page to download dataset to evaluate on them.

ARO&VALSE

Evaluation code for ARO is included in Enhance-FineGrained/vision-language-models-are-bows, to reproduce results, you need

set up environment by running bash Enhance-FineGrained/vision-language-models-are-bows/scripts/create_environment.sh
cd Enhance-FineGrained/vision-language-models-are-bows/scripts and change the checkpoint path in reproduce_aro.sh, then run the script to reproduce the results. Note that dataset will be download automatically
Evaluation code for VALSE is included in Enhance-FineGrained/VALSE, to reproduce results on valse, please download dataset here first. Then replace dataset path in Enhance-FineGrained/VALSE/clip_valse_eval.py Enhance-FineGrained/VALSE/xvlm_valse_eval.py
replace $checkpoint in Enhance-FineGrained/VALSE/scripts then run the scripts, evaluation results will be included in /home/mila/l/le.zhang/scratch/Enhance-FineGrained/VALSE/output

VL-CheckList [Not Suggested]

❗ Note: The original dataset is not complete, we encourage skip this dataset

Please refer to official github repo to download dataset and perform evaluation. Note that Downloading the dataset can be quite cumbersome

we provide script at here

🌟 SugarCrepe

SugarCrepe is a benchmark for faithful vision-language compositionality evaluation. This dataset fix a several biases in all above benchmarks rendering them hackable that blind models with no access to the image outperform state-of-the-art vision-language models.

to evaluate on this dataset, simply clone their repo and follow their installation setup, and assign retrained to our checkpoints

python main_eval.py --model ViT-B-32 --pretrained Enhance-FineGrained/clip/epoch_5.pt \
    --output ./output \
    --coco_image_root ./data/coco/images/val2017/ \
    --data_root ./data/ \

Ablations

Our method entails curriculum learning, which is validated by the growth of adaptive threshold

📎 Citation

@article{zhang2023contrasting,
  title={Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding},
  author={Zhang, Le and Awal, Rabiul and Agrawal, Aishwarya},
  journal={arXiv preprint arXiv:2306.08832},
  year={2023}
}

📧 Contact

please let us know if you have further questions or comments, reach out to [email protected]

enhance-finegrained's People

Contributors

Stargazers

Watchers

Forkers

madhuradeshpande05

enhance-finegrained's Issues

Interpretation of the intra-modal contrastive (IMC) loss

Dear authors,
Thank you for sharing the codes and checkpoint from your wonderful CVPR work!
The motivation for treating hard negative texts, which finally derives IMC and CMR loss terms, was really cool to me!

However, while reading the paper, the formulation of the IMC loss was confusing for me.
To my understanding, its motivation is to make a further distinction between the (original) caption and hard negative captions in the text embedding space, which is clear to me, but I cannot connect this to the exact design of equation (3) in the paper.
Is there any reference or intuition behind it?

I also checked its implementation, which does not seem to align with the equation.
It rather tries to maximize the similarity between the (original) caption and the first hard negative caption by the contrastive loss, if I am not mistaken.

### src/training/data.py
class HardNegative_Collate:

    def __call__(self, batch):
        img = torch.stack([example[0] for example in batch])
        ture_caption = torch.stack([example[1][0] for example in batch])
        hard_negative = torch.cat([example[1][1:] for example in batch])
        text = torch.cat([ture_caption, hard_negative])
        valid_caption_mask = torch.stack([example[2] for example in batch])
        return img, text, valid_caption_mask

### src/open_clip/loss.py
# gt_text_features = text_features[:image_features.shape[0]]
# da_text_features = text_features[image_features.shape[0]:]  <- contains only hard negatives?
# embedding_matrix = logit_scale * gt_text_features @ da_text_features.T  #(batch_size,4*batch_size)
def get_imc_loss(self, gt_logits_per_image: torch.Tensor, embedding_matrix: torch.Tensor):
    """
    gt_logits_per_image: standard clip similarity matrix, diag is true gt similarity value : shape [batch_size,5xbatch_size]
    embedding_matrix: extra similarity matrix served as denominator in clip loss
    """

    logtis_matrix = embedding_matrix
    labels = torch.zeros(logtis_matrix.shape[0], device=logtis_matrix.device, dtype=torch.long)
    imc_loss = F.cross_entropy(logtis_matrix, labels)
    return imc_loss

In my humble opinion, prepending an additional column corresponding to the positive caption representation to the embedding_matrix is necessary for the contrastive loss to work desirably.

Could you share your insights or point out anything that I'm missing?

Best regards,

About the performance on ELEVATER

hello, sorry to bother you again, I notice that the CE-CLIP's performance on ELEVATER reported in the paper is 53.2, however in my case this is 44.4 using your provided checkpoint. As this is a huge gap, and I don't know if my code is wrong or for some other reasons. Would you mind share you test code on ELEVATER? Thanks very much!

Cannot reproduce the results on VL-Checklist

hello, thanks for your wonderful work! I cannot reproduce the result on VL-checklist using your provided checkpoint.
The finegrained results I get are as follow:

VL_checklist Relation: action (0.7833), spatial (0.7238), average (0.7535) 
VL_checklist Attribute: action (0.7733), color (0.7729), material (0.8091), size (0.6823), state (0.6828), average (0.7441)
VL_checklist Object: location (0.8121), size (0.7984), average (0.8053)

Attention visualization

We need some attention visualization to show how attention gets better with our proposed method.

Write a function for visualizing attention on a given
Figure out which technique we should use for attention maps
Sample a few random images and plot attention to them.
@Magiccircuit will share a paper that he already considered for this task

Some motivation from here https://arxiv.org/pdf/2012.09838.pdf

total datasize after augmentation/training dataset

Hello～ I'd like to know how many samples there are after conducting data augmentation on the text? Could you share the dataset you used for training (after augmentation)? Thank you very much!

Winoground evaluation on compositional splits

The task here is evaluating winoground on fine-grained compositional script

Design a class for winoground evaluation
Test on compositional splits
Report on composition splits

training epoch

I'm curious about the number of epochs your training process involved. According to the paper, it states that the model was trained for 5 epochs; however, in your training script, it appears that the training was actually performed for 10 epochs. This discrepancy has left me somewhat puzzled. I would greatly appreciate it if you could take a moment to explain this. Thank you!

Onboarding for @rabiulcste: Running Evaluations on the VL Checklist, Plus Knowledge Sharing

This issue is intended for the onboarding process of @rabiulcste. The goal of this onboarding is to get Rabiul up to speed with the project and enable him to contribute effectively. As part of this onboarding process, Rabiul will need to run evaluations on the VL checklist. The following steps outline what needs to be done:

The first step for Rabiul is to read the paper, which is a CVPR workshop submission, to gain an understanding of the project. This will help him familiarize himself with the overall objectives and methodology of the project.
As recommended by @Magiccircuit, Rabiul's main task, to begin with, is to run the necessary evaluations on the VL checklist. This is a crucial step in ensuring the accuracy and efficacy of the project's algorithms. The VL checklist can be found at this link: https://arxiv.org/abs/2207.00221
To facilitate Rabiul's evaluation process, Le will share all the checkpoints required to run the codebase. These checkpoints can be found in the /networks/project/aishwaryalab directory.