Giter Site home page Giter Site logo

lezhang7 / enhance-finegrained Goto Github PK

View Code? Open in Web Editor NEW
34.0 2.0 1.0 16.39 MB

[CVPR' 2024] Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding

License: Other

Makefile 0.03% Python 97.50% Shell 0.81% Jupyter Notebook 1.65%

enhance-finegrained's Introduction

πŸŽ‰ The paper was accepted to CVPR 2024:

TL;DR: We Propose two losses on our generated hard negative examples to enhance model's compositional understanding ability for CLIP.

motivation

This repo forks from wonderful OpenCLIP, for model and training details, please refer to original repo.

β˜‘οΈ Checkpoints

The checkpoints could be downloaded directly using gdown with following script:

pip install --upgrade --no-cache-dir gdown # must update gdown to avoid bugs, thanks to https://github.com/wkentaro/gdown/issues/146
gdown 1DWPw3CtGh5cHz9bW_-iXRSG7BBUVl13K #download checkpoint for CE-CLIP

Training

1. Generating Training dataset

The training data is generated based on COCO 2014, so you can either download by yourself and assign coco dataset_path in dataset.py or you can simply run following script to download and generate dataset

cd data/
bash prepare_dataset.sh

2. Training

you need to specify training parameters in scrips/run_all.sh such as --gres=gpu:a100:2 and batch_size, please refer to this script file to see more details, to simply run the training, using following scritps

cd scripts/
bash run_multiple_nodes.sh

The result checkpoint will be at Enhance-FineGrained/src/Outputs

Evaluation

We evaluate our method on four downstream task ARO, VALSE and VL-CheckList, and very recent SugarCrepe and we also provide evaluation code. However, one need go to official github page to download dataset to evaluate on them.

ARO&VALSE

ARO

Evaluation code for ARO is included in Enhance-FineGrained/vision-language-models-are-bows, to reproduce results, you need

  1. set up environment by running bash Enhance-FineGrained/vision-language-models-are-bows/scripts/create_environment.sh

  2. cd Enhance-FineGrained/vision-language-models-are-bows/scripts and change the checkpoint path in reproduce_aro.sh, then run the script to reproduce the results. Note that dataset will be download automatically


  3. Evaluation code for VALSE is included in Enhance-FineGrained/VALSE, to reproduce results on valse, please download dataset here first. Then replace dataset path in Enhance-FineGrained/VALSE/clip_valse_eval.py Enhance-FineGrained/VALSE/xvlm_valse_eval.py

  4. replace $checkpoint in Enhance-FineGrained/VALSE/scripts then run the scripts, evaluation results will be included in /home/mila/l/le.zhang/scratch/Enhance-FineGrained/VALSE/output

VL-CheckList [Not Suggested]

vlchecklist

❗ Note: The original dataset is not complete, we encourage skip this dataset

Please refer to official github repo to download dataset and perform evaluation. Note that Downloading the dataset can be quite cumbersome

we provide script at here

🌟 SugarCrepe

sugarcrepe

SugarCrepe is a benchmark for faithful vision-language compositionality evaluation. This dataset fix a several biases in all above benchmarks rendering them hackable that blind models with no access to the image outperform state-of-the-art vision-language models.

to evaluate on this dataset, simply clone their repo and follow their installation setup, and assign retrained to our checkpoints

python main_eval.py --model ViT-B-32 --pretrained Enhance-FineGrained/clip/epoch_5.pt \
    --output ./output \
    --coco_image_root ./data/coco/images/val2017/ \
    --data_root ./data/ \

Ablations

Our method entails curriculum learning, which is validated by the growth of adaptive threshold

abaltion

πŸ“Ž Citation

@article{zhang2023contrasting,
  title={Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding},
  author={Zhang, Le and Awal, Rabiul and Agrawal, Aishwarya},
  journal={arXiv preprint arXiv:2306.08832},
  year={2023}
}

πŸ“§ Contact

please let us know if you have further questions or comments, reach out to [email protected]

enhance-finegrained's People

Contributors

lezhang7 avatar rabiulcste avatar

Stargazers

xiaoxing01 avatar  avatar Seitaro Shinagawa avatar Heejung Choi avatar  avatar Kaicheng Yang avatar GDRS avatar  avatar ζŽι›„ avatar  avatar Xue Jiang avatar Thanh Tin Nguyen avatar Eric Jiang avatar wuyujack (Mingfu Liang) avatar YOLO avatar Hyungwook Choi avatar  avatar  avatar zhaozh10 avatar Seokhyeon Jeong avatar Shreyas Jaiswal avatar Piyush Bagad avatar Youngtaek Oh avatar Wujian Peng avatar Jianjie(JJ) Luo avatar Jeff Carpenter avatar  avatar Vishaal Udandarao avatar  avatar Luke Meyers avatar  avatar  avatar ShenXiaolei avatar DY avatar

Watchers

 avatar Kostas Georgiou avatar

enhance-finegrained's Issues

Interpretation of the intra-modal contrastive (IMC) loss

Dear authors,
Thank you for sharing the codes and checkpoint from your wonderful CVPR work!
The motivation for treating hard negative texts, which finally derives IMC and CMR loss terms, was really cool to me!

However, while reading the paper, the formulation of the IMC loss was confusing for me.
To my understanding, its motivation is to make a further distinction between the (original) caption and hard negative captions in the text embedding space, which is clear to me, but I cannot connect this to the exact design of equation (3) in the paper.
Is there any reference or intuition behind it?

I also checked its implementation, which does not seem to align with the equation.
It rather tries to maximize the similarity between the (original) caption and the first hard negative caption by the contrastive loss, if I am not mistaken.

### src/training/data.py
class HardNegative_Collate:

    def __call__(self, batch):
        img = torch.stack([example[0] for example in batch])
        ture_caption = torch.stack([example[1][0] for example in batch])
        hard_negative = torch.cat([example[1][1:] for example in batch])
        text = torch.cat([ture_caption, hard_negative])
        valid_caption_mask = torch.stack([example[2] for example in batch])
        return img, text, valid_caption_mask

### src/open_clip/loss.py
# gt_text_features = text_features[:image_features.shape[0]]
# da_text_features = text_features[image_features.shape[0]:]  <- contains only hard negatives?
# embedding_matrix = logit_scale * gt_text_features @ da_text_features.T  #(batch_size,4*batch_size)
def get_imc_loss(self, gt_logits_per_image: torch.Tensor, embedding_matrix: torch.Tensor):
    """
    gt_logits_per_image: standard clip similarity matrix, diag is true gt similarity value : shape [batch_size,5xbatch_size]
    embedding_matrix: extra similarity matrix served as denominator in clip loss
    """

    logtis_matrix = embedding_matrix
    labels = torch.zeros(logtis_matrix.shape[0], device=logtis_matrix.device, dtype=torch.long)
    imc_loss = F.cross_entropy(logtis_matrix, labels)
    return imc_loss

In my humble opinion, prepending an additional column corresponding to the positive caption representation to the embedding_matrix is necessary for the contrastive loss to work desirably.

Could you share your insights or point out anything that I'm missing?

Best regards,

About the performance on ELEVATER

hello, sorry to bother you again, I notice that the CE-CLIP's performance on ELEVATER reported in the paper is 53.2, however in my case this is 44.4 using your provided checkpoint. As this is a huge gap, and I don't know if my code is wrong or for some other reasons. Would you mind share you test code on ELEVATER? Thanks very much!

Cannot reproduce the results on VL-Checklist

hello, thanks for your wonderful work! I cannot reproduce the result on VL-checklist using your provided checkpoint.
The finegrained results I get are as follow:

VL_checklist Relation: action (0.7833), spatial (0.7238), average (0.7535) 
VL_checklist Attribute: action (0.7733), color (0.7729), material (0.8091), size (0.6823), state (0.6828), average (0.7441)
VL_checklist Object: location (0.8121), size (0.7984), average (0.8053)

Attention visualization

We need some attention visualization to show how attention gets better with our proposed method.

  • Write a function for visualizing attention on a given
  • Figure out which technique we should use for attention maps
  • Sample a few random images and plot attention to them.
  • @Magiccircuit will share a paper that he already considered for this task

Some motivation from here https://arxiv.org/pdf/2012.09838.pdf
image

training epoch

I'm curious about the number of epochs your training process involved. According to the paper, it states that the model was trained for 5 epochs; however, in your training script, it appears that the training was actually performed for 10 epochs. This discrepancy has left me somewhat puzzled. I would greatly appreciate it if you could take a moment to explain this. Thank you!

Onboarding for @rabiulcste: Running Evaluations on the VL Checklist, Plus Knowledge Sharing

This issue is intended for the onboarding process of @rabiulcste. The goal of this onboarding is to get Rabiul up to speed with the project and enable him to contribute effectively. As part of this onboarding process, Rabiul will need to run evaluations on the VL checklist. The following steps outline what needs to be done:

  1. The first step for Rabiul is to read the paper, which is a CVPR workshop submission, to gain an understanding of the project. This will help him familiarize himself with the overall objectives and methodology of the project.
  2. As recommended by @Magiccircuit, Rabiul's main task, to begin with, is to run the necessary evaluations on the VL checklist. This is a crucial step in ensuring the accuracy and efficacy of the project's algorithms. The VL checklist can be found at this link: https://arxiv.org/abs/2207.00221
  3. To facilitate Rabiul's evaluation process, Le will share all the checkpoints required to run the codebase. These checkpoints can be found in the /networks/project/aishwaryalab directory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.