facebookresearch / simat Goto Github PK

codebase for the SIMAT dataset and evaluation

License: MIT License

Python 0.70% Jupyter Notebook 99.30%

simat's Introduction

This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation (Guillaume Couairon, Holger Schwenk, Matthijs Douze, Matthieu Cord)

The inspiration for this work are the geometric properties of word embeddings, such as Queen ~ Woman + (King - Man). We extend this idea to multimodal embedding spaces (like CLIP), which let us semantically edit images via "delta vectors".

Transformed images can then be retrieved in a dataset of images.

The SIMAT Dataset

We build SIMAT, a dataset to evaluate the task of text-driven image transformation, for simple images that can be characterized by a single subject-relation-object annotation. A transformation query is a pair (image, query) where the query asks to change the subject, the relation or the object in the input image. SIMAT contains ~6k images and an average of 3 transformation queries per image.

The goal is to retrieve an image in the dataset that corresponds to the query specifications. We use OSCAR as an oracle to check whether retrieved images are correct with respect to the expected modifications.

Examples

Below are a few examples that are in the dataset, and images that were retrieved for our best-performing algorithm.

Download dataset

The SIMAT database is composed of crops of images from Visual Genome. You first need to install Visual Genome and then run the following command :

python prepare_dataset.py --VG_PATH=/path/to/visual/genome

Perform inference with CLIP ViT-B/32

In this example, we use the CLIP ViT-B/32 model to edit an image. Note that the dataset of clip embeddings is pre-computed.

import clip
from torchvision import datasets
from PIL import Image
from IPython.display import display

#hack to normalize tensors easily
torch.Tensor.normalize = lambda x:x/x.norm(dim=-1, keepdim=True)

# database to perform the retrieval step
dataset = datasets.ImageFolder('simat_db/images/')

db = torch.load('data/simat_img_clip.pt')
db_stacked = torch.stack(list(db.values())).float()

idx2rid = list(db.keys())

model, prep = clip.load('ViT-B/32', device=device)

image = Image.open('simat_db/images/images/98316.png')
img_enc = model.encode_image(prep(image).unsqueeze(0).to('cuda:0')).float().cpu().detach()

txt = ['cat', 'dog']
txt_enc = model.encode_text(clip.tokenize(txt).to('cuda:0')).float().cpu().detach()

# optionally, we can apply a linear layer on top of the embeddings
heads = torch.load(f'data/head_clip_t=0.1.pt')
img_enc = heads['img_head'](img_enc)
txt_enc = heads['txt_head'](txt_enc)

db = heads['img_head'](db).normalize()


# now we perform the transformation step
lbd = 1
target_enc = img_enc.normalize() + lbd * (txt_enc[1].normalize() - txt_enc[0].normalize())


retrieved_idx = (db_stacked @ target_enc.float().T).argmax(0).item()

retrieved_rid = idx2rid[retrieved_idx]

display(Image.open(f'simat_db/images/images/{retrieved_rid}.png'))

Compute SIMAT scores with CLIP

You can run the evaluation script with the following command:

python eval.py --backbone clip --domain dev --tau 0.01 --lbd 1 2

It automatically load the adaptation layer relative to the value of tau.

Train adaptation layers on COCO

In this part, you can train linear layers after the CLIP encoder on the COCO dataset, to get a better alignment. Here is an example :

python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

Citation

If you find this paper or dataset useful for your research, please use the following.

@article{gco1embedding,
  title={Embedding Arithmetic for text-driven Image Transformation},
  author={Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk},
  journal={arXiv preprint arXiv:2112.03162},
  year={2021}
}

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, OpenAI 2021

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, IJCV 2017

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao, Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020

License

The SIMAT is released under the MIT license. See LICENSE for details.

simat's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory

simat's Issues

RuntimeError when finetuning CLIP adaptation layers

Hi,

When I ran the command python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512 --gpus 2, I encountered the error. The error is as follows :

  ...
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
    return self.module.validation_step(*inputs, **kwargs)
  File "adaptation.py", line 99, in validation_step
    img_ = self.core.encode_image(img).detach()
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/clip/model.py", line 341, in encode_image
    return self.visual(image.type(self.dtype))
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/clip/model.py", line 224, in forward
    x = self.conv1(x)  # shape = [*, width, grid, grid]
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [768, 3, 32, 32], expected input[1, 1536, 224, 224] to have 3 channels, but got 1536 channels instead

I thought the error occurred since the size of the batch and the channel are not decoupled.
To resolve the error, I inserted a line of code img = img.reshape(-1, 3, img.shape[-1], img.shape[-1]) which separates the batch and the channel at Line77 and Line90 in adaptation.py. Then, I could fully run the code without the error.

Is my modification the right solution? I had another reproduction problem in this code, but I'm worried if it's the cause here.

Thanks,
Janet

SIMAT score of zero-shot CLIP not reproduced

Hi!
I'm trying to reproduce SIMAT score of zero shot CLIP in paper. (Table1 , Fig5)
I used eval.py for reproducing this score, and change it slightly to remove adaptation layer.

def simat_eval(args):
    #img_head, txt_head, emb_key='clip', lbds=[1], test=True:, tau
    # get heads !
    emb_key = 'clip'
    heads = torch.load(f'data/head_{emb_key}_t={args.tau}.pt')
    output = {}
    transfos = pd.read_csv('simat_db/transfos.csv', index_col=0)
    transfos = transfos[transfos.is_test == (args.domain == 'test')]
    img_embs = torch.load('data/clip_simat_2.pt').float()


    img_embs = img_embs.normalize()
    #img_embs = heads['img_head'](clip_simat).normalize()
    value_embs = torch.stack([img_embs[did] for did in transfos.dataset_id])

    word_embs = dict(torch.load(f'data/simat_words_{emb_key}_2.ptd'))
    #w2v = {k:heads['txt_head'](v.float()).normalize() for k, v in word_embs.items()}
    w2v = {k:(v.float()).normalize() for k, v in word_embs.items()}
    delta_vectors = torch.stack([w2v[x.target] - w2v[x.value] for i, x in transfos.iterrows()])

    oscar_scores = torch.load('simat_db/oscar_similarity_matrix.pt')
    weights = 1/np.array(transfos.norm2)**.5
    weights = weights/sum(weights)

    for lbd in args.lbds:
        target_embs = value_embs + lbd*delta_vectors

        nnb = (target_embs @ img_embs.T).topk(5).indices

        nnb_notself = [r[0] if r[0].item() != t else r[1] for r, t in zip(nnb, transfos.dataset_id)]

        scores = np.array([oscar_scores[ri, tc] for ri, tc in zip(nnb_notself, transfos.target_ids)]) > .5


        output[lbd] = 100*np.average(scores, weights=weights)
    return output

when i use your embedding file (clip_simat.pt , simat_words_clip.ptd) it results well , but I cannot get full performance with the embedding that I made with encode.py.
(I used the original CLIP model in openAI repository, so maybe it is not problem of CLIP model.)

I would be appreciated if you check it. Thank you

Problem on reproducing adaptation layer

Hi @PhazCode,

Thanks for this great work.

I tried to reproduce training new adaptation layers of CLIP on the COCO dataset.
After training, I compute SIMAT scores with my trained weights and the results are as follows :

lbd=1.0: 13.09
lbd=2.0: 25.78
lbd=3.0: 29.94
lbd=4.0: 28.73
lbd=5.0: 26.21

When I compute SIMAT scores with provided weights, the results are as follows :

lbd=1.0: 47.59
lbd=2.0: 35.78
lbd=3.0: 29.04
lbd=4.0: 26.36
lbd=5.0: 24.52

For training, I followed the hyperparameters mentioned in paper. The other parameters were initialized with the default values at adaptation.py. The hyperparameters are as follows :

max_epochs : 30
batch_size : 4096
lr : 1e-3
tau : 0.1
sched_step_size : 25
sched_gamma : 0.1

The command that I used was python adaptation.py --backbone ViT-B/32 --lr 1e-3 --tau 0.1 --batch_size 4096 --wandb --max_epochs 30
And I also ran the command written in README.md python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

Unfortunately I couldn't retreive the same or similar results of the provided weights in both commands.
Is there any mistakes that I did or any solution to resolve this problem?

My settings are as follows:

Pytorch : 1.12.1
CUDA : 11.3
Python : 3.8.13
gpu : 8 Tesla V100 GPUs

Thanks,
Janet

Unable to reproduce adaptation results

Hi,

Congrats on the amazing work!

I am trying to reproduce the results of your fine-tuning CLIP on MS-COCO experiment (figure 5, section 5.2 in the paper) -- however I am running into issues while doing the exact fine-tuning and am getting lower SIMAT scores than reported in your paper.

These are the exact training loss and validation loss plots while doing the fine-tuning:

Adaption at tau=0.01:

Adaption at tau=0.1:

I used the exact same hyperparameter settings as you have done in your adaptation script:

learning rate=1e-3
lr decay schedule with step_size=25 and gamma=0.1
num_epochs=50
gradient clipping at norm=1

Do you have any insights into where the training script might be going wrong and why the loss seems to be stagnating as we step through training? After adaptation (with these training plots), I get a SIMAT score of 37.10 compared to your 47.5 (at tau=0.1, lambda=1). Similarly, I get a SIMAT score of around 16.61 compared to your 17.10 (at tau=0.01, lambda=1).

Note: I had to reimplement chunks of your code in pytorch as I believe parts of your adaptation script were incomplete in Pytorch lightning -- I would be happy to share my adaptation script with you if that would help!

Hoping for a prompt response.