cuberick-orion / cirr Goto Github PK

View Code? Open in Web Editor NEW

94.0 2.0 3.0 4.99 MB

Official repository of ICCV 2021 - Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

Home Page: https://cuberick-orion.github.io/CIRR/

License: MIT License

image-retrieval composed-image-retrieval

cirr's Introduction

Composed Image Retrieval on Real-life Images

This repository contains the Composed Image Retrieval on Real-life images (CIRR) dataset.

For details please see our ICCV 2021 paper - Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models.

^{You are currently viewing the Dataset repository.
Site navigation > Project homepage | Code repository}

News and Upcoming Updates

Jun 2024 Please contact us if you are having trouble gaining access to the raw images from NLVR2.
Jun 2024 Download links have been updated.
Please note there is a typo in our paper (Table 2) -- the number of pairs in val is ~~4,184~~ 4,181.

Download CIRR Dataset

Our dataset is structured in a similar way as Fashion-IQ, an existing dataset on this task.

Annotations

Obtain the annotations by:

# create a `data` folder at your desired location
mkdir data
cd data

# clone the cirr_dataset branch to the local data/cirr folder
git clone -b cirr_dataset [email protected]:Cuberick-Orion/CIRR.git cirr

The data/cirr folder contains all relevant annotations. The file structure is described below.

Raw Images

Updated June 2023 Recent methods of Composed Image Retrieval (and related tasks) often use raw images rather than our pre-extracted features. Though, we are not at liberty to distribute these images. If you'd like to access them, please see below.

Important

We do not recommend downloading the images by URLs, as it contains too many broken links, and the downloaded files lack the sub-folder structure in the /train folder. Instead, we suggest following the instructions here to directly access the images. To quote the authors:

To obtain access, please fill out the linked Google Form. This form asks for your basic information and asks you to agree to our Terms of Service. We will get back to you within a week. If you have any questions, please email [email protected].

You can also email us if, for any reason, you receive no response from the NLVR2 group.

Pre-extracted Image Features

The available types of image features are:

ImageNet pre-trained ResNet152 features
- can be extracted from raw images
- or download our pre-extracted features
F-RCNN image regional features
- provided by OSCAR as we source our images from NLVR2
- download the subset of features used in CIRR (filtered out unused images and re-zipped by us)
- alternatively, download directly from OSCAR

Each zip file we provide contains a folder of individual image feature files .pkl.

Once downloaded, unzip it into data/cirr/, following the file structure below.

Dataset File Structure

The downloaded dataset should look like this (click to expand)

data
└─── cirr
    ├─── captions
    │        cap.VER.test1.json
    │        cap.VER.train.json
    │        cap.VER.val.json
    ├─── captions_ext
    │        cap.ext.VER.test1.json
    │        cap.ext.VER.train.json
    │        cap.ext.VER.val.json
    ├─── image_splits
    │        split.VER.test1.json
    │        split.VER.train.json
    │        split.VER.val.json
    ├─── img_raw  
    │    ├── train
    │    │    ├── 0 # sub-level folder structure inherited from NLVR2 (carries no special meaning in CIRR)
    │    │    │    <IMG0_ID>.png
    │    │    │    <IMG0_ID>.png
    │    │    │         ...
    │    │    ├── 1
    │    │    │    <IMG0_ID>.png
    │    │    │    <IMG0_ID>.png
    │    │    │         ...
    │    │    ├── 2
    │    │    │    <IMG0_ID>.png
    │    │    │    <IMG0_ID>.png
    │    │    └──       ...
    │    ├── dev         
    │    │      <IMG0_ID>.png
    │    │      <IMG1_ID>.png
    │    │           ...
    │    └── test1       
    │           <IMG0_ID>.png
    │           <IMG1_ID>.png
    │                ...
    ├─── img_feat_res152 
    │        <Same subfolder structure as above>
    └─── img_feat_frcnn         
             <Same subfolder structure as above>

Dataset File Description

captions/cap.VER.SPLIT.json

A list of elements, where each element contains core information on a query-target pair.
Details on each entry can be found in the supp. mat. Sec. G of our paper.

Click to see an example

    {"pairid": 12063, 
    "reference":   "test1-147-1-img1", 
    "target_hard": "test1-83-0-img1", 
    "target_soft": {"test1-83-0-img1": 1.0}, 
    "caption": "remove all but one dog and add a woman hugging   it", 
    "img_set": {"id": 1, 
                "members": ["test1-147-1-img1", 
                            "test1-1001-2-img0",  
                            "test1-83-1-img1",           
                            "test1-359-0-img1",  
                            "test1-906-0-img1", 
                            "test1-83-0-img1"],
                "reference_rank": 3, 
                "target_rank": 4}
    }

captions_ext/cap.ext.VER.SPLIT.json

A list of elements, where each element contains auxiliary annotations on a query-target pair.
Details on the auxiliary annotations can be found in the supp. mat. Sec. C of our paper.

Click to see an example

    {"pairid": 12063, 
    "reference":   "test1-147-1-img1", 
    "target_hard": "test1-83-0-img1", 
    "caption_extend": {"0": "being a photo of dogs", 
                      "1": "add a big dog", 
                      "2": "more focused on the hugging", 
                      "3": "background should contain grass"}
    }

image_splits/split.VER.SPLIT.json
- A dictionary, where each key:value pair maps an image filename to the relative path of the img file, example:
```
"test1-147-1-img1": "./test1/test1-147-1-img1.png",
# or
"train-11041-2-img0": "./train/34/train-11041-2-img0.png"
```
- image filenames and (train-split) sub-level folder structures are preserved from the NLVR2 dataset.
img_feat_<...>/
- A folder containing a certain type of pre-extracted image features, each file saves the feature of one image.
- Filename is generated as such:
```
<IMG0_ID> = "test1-147-1-img1.png".replace('.png','.pkl')
```
  in this case, test1-147-1-img1.pkl, so that each file can be directly indexed by its name.

Test-split Evaluation Server

We do not publish the ground truth for the test split of CIRR. Instead, an evaluation server is hosted here, should you prefer to publish results on the test-split. The functions of the test-split server will be incrementally updated.

See test-split server instructions.

The server is hosted independently at CECS ANU, so please email us if the site is down.

License

We have licensed the annotations of CIRR under the MIT License. Please refer to the LICENSE file for details.
Following NLVR2 Licensing, we do not license the images used in CIRR, as we do not hold the copyright to them.
The images used in CIRR are sourced from the NLVR2 dataset. Users shall be bounded by its Terms of Service.

Citation

Please cite our paper if it helps your research:

@InProceedings{Liu_2021_ICCV,
    author    = {Liu, Zheyuan and Rodriguez-Opazo, Cristian and Teney, Damien and Gould, Stephen},
    title     = {Image Retrieval on Real-Life Images With Pre-Trained Vision-and-Language Models},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {2125-2134}
}

Contact

If you have any questions regarding our dataset, model, or publication, please create an issue in the project repository, or email us.

cirr's People

Contributors

Stargazers

Watchers

Forkers

sorrowyn peternara proxi190

cirr's Issues

I want to know about the loss function more definitely

In the paper, there is loss function like "L=log[1+exp(k(φi, ϕ−i,j) - k(φi, ϕ+i))]"
for the loss to be zero, the value of k(φi, ϕ−i,j) must be small and the value of k(φi, ϕ+i) is large.

But I think k(φi, ϕ+i) should be small because it is the l2 distance between prediction and target, and k(φi, ϕ−i,j) should be large because it is the distance between prediction and false image's feature, so the Loss function should be changed as follows.
"L=log[1+exp(k(φi, ϕ+i) - k(φi, ϕ−i,j))]"

I want to know if what I was thinking is correct.

Subset of the NLVR^2

Hi, I need the raw images in my research and i submit the google drieve. After that, i receieved the reply of the NLVR^2 datasets, But I don't know how to select the subset contained in CIRR. Can you help me?

Clarification on Table 3: "Retrieval performance on CIRR"

Hello

I would like to have some clarifications on Table3: “Retrieval performance on CIRR”:
In particular I would like to ask whether the query image (a.k.a the reference image) is removed from the retrieved results when calculating metrics.

From the results, particularly the Random (theoretical) row, we can infer this.
In fact if the query image was not removed the Recall_subset@1 should be 1/6 (16.6%) instead of 1/5 (20%).

Before proceeding with my experiments I would like to have a confirmation directly from you.

Thanks for the amazing work

Question about image feature extraction method

Hello?

I would like to test on custom images with CIRPLANT pre-trained model.
However, the released code does not contain a method for extracting image features.

I tested with the PyTorch pre-trained resnet152 model.
However, the extracted image features are different from the released ones.

My Test Code

import numpy as np
import pickle
import PIL
import torch
import torchvision

# from trainval_oscar.py
transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize(224),
    torchvision.transforms.CenterCrop(224),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# from model/datasets.py
PIL.Image.MAX_IMAGE_PIXELS = 1000000000
PIL.Image.warnings.simplefilter('error', PIL.Image.DecompressionBombWarning)

with open('./data/cirr/data/test1/test1-0-0-img0.png', 'rb') as f:

    # from model/datasets.py
    im = PIL.Image.open(f)
    im = im.convert('RGB')
    im = transform(im)

im = torch.unsqueeze(im, 0)

model_ft = torchvision.models.resnet152(pretrained=True)
feature_extractor = torch.nn.Sequential(*list(model_ft.children())[:-1])

o = feature_extractor(im)
feat = o.squeeze().detach().numpy()

print('myval:', type(feat), feat.shape, feat)

with open('./data/cirr/img_feat_res152/test1/test1-0-0-img0.pkl', 'rb') as f:
    feat = pickle.load(f)

print('paper:', type(feat), feat.shape, feat)

myval: <class 'numpy.ndarray'> (2048,) [0.7299092  0.48872587 0.6812962  ... 0.45136097 0.3343735  0.38762665]
paper: <class 'numpy.ndarray'> (2048,) [1.111343   0.26588327 0.13394219 ... 0.20811372 0.08829021 0.12060333]

Can you explain how you extracted the image features for train/val/test images?

Thanks for your great work

Clarification on Retrieval Metrics

@Cuberick-Orion

Thank you for the great work!

I'm just starting out surveying image retrieval literature for usage with pretrained vision-language models and had a clarification question regarding the evaluation metrics.

Recall_subset@K as designed in this paper makes sense to me because the cardinality of the subset (in this case, 5, since we're excluding the image itself) is the total number of relevant items for each query. This lines up with the definition of Recall@K within recommender systems (for eg. https://www.pinecone.io/learn/offline-evaluation/).

My question relates to the Recall@1, 10, 50 numbers reported in Table 3. How were these numbers calculated? What is the value of the denominator when calculating Recall since we don't know the true number of False negatives? In section 4 you mention that this is what is commonly reported in other work by setting K to a large value and all images in
D {IR, IT} are considered as negative. Does that mean when calculating Recall@50, you just count the false negatives as everything outside the 6-member subset of the query image?

Status: Submission failed. Possible reasons: reCAPTCHA validation failed, invalid email address, input file invalid or exceeds size limit. Please check and re-try- if issue remains, please contact us.

The problem keeps popping up. I use the same email and the same submission file. But it succeeds randomly, is it possible to solve this problem?

Encounter Error when upload test result to server

Submission server problem?

I can submit several hour ago but then I cannot submit anymore, it always pop-up like #15. Is there any problems on the server?

Access to evaluation server

Hi, thanks for publishing the dataset and code.
I have a question on evaluation server for test split.
The server looks not working for me, the link shows "502 Bad Gateway".

Is there a way to evaluate a model on test split?

Thanks

Question about CIRR' caption.json file

What does the target_rank and reference_rank mean in the caption.json file?

How to convert the .pkl file to .png

How to convert .pkl file to .png
My research needs .png file not the .pkl file

Some questions about paper

Thank you for sharing your work. I'm confused about some of the content in the paper. Is the vision-and-language (VLP) multi-layer transformers pre-trained? What pretraining model was used and what dataset was the pretraining done?

Number of validation samples in annotations

Thank you very much for contributing this great dataset !

As far as I can see in the (recently downloaded annots) there are 4_181 samples and not 4_184 pairs as mentionned in the paper (table 2) for the validation split.
Could you let me know if this the correct number of expected samples for this split ?

Have a great day !

Yana

How to download raw images in a more faster way?

I found two ways to download raw images.
The first one is downloading through URLs, which have terrible disadvantages.
The other way is sending my request to Google Form, but does it really need a week to reply me? I hope it can be faster. Thanks a lot!!!

problem in loading features

Hello,
CIRR is really a brilliant work !

now i am trying to load faster rcnn feature using pickle.load, I encounter some strange problems

it is fine to load resnet 152 feature, searching for google dose not find any clue.