Giter Site home page Giter Site logo

yalesong / pvse Goto Github PK

View Code? Open in Web Editor NEW
131.0 4.0 24.0 16.36 MB

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)

License: MIT License

Python 98.30% Shell 1.70%
cross-modal-retrieval metric-learning mscoco-dataset mrw-dataset tgif-dataset

pvse's Introduction

Polysemous Visual-Semantic Embedding (PVSE)

This repository contains a PyTorch implementation of the PVSE network and the MRW dataset proposed in our paper Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019). The code and data are free to use for academic purposes only.

Please also visit our project page

Table of contents

  1. MRW Dataset
  2. Setting up an environment
  3. Download and prepare data
  4. Evaluate pretrained models
  5. Train your own model

MRW Dataset

Our My Reaction When (MRW) dataset contains 50,107 video-sentence pairs crawled from social media, where videos display physical or emotional reactions to the situations described in sentences. This subreddit /r/reactiongifs contains several examples; below shows some representative examples pairs;

(a) Physical Reaction (b) Emotional Reaction (c) Animal Reaction (d) Lexical Reaction
MRW a witty comment I wanted to make was already said MFW I see a cute girl on Facebook change her status to single MFW I cant remember if I've locked my front door MRW a family member askes me why his computer isn't working

Below shows the descriptive statistics of the datset. The word vocabulary size is 34,835. The dataset can be used for evaluting cross-modal retrieval systems under ambiguous/weak-association between vision and language.

Train Validation Test Total
#pairs 44,107 1,000 5,000 50,107
Avg. #frames 104.91 209.04 209.55 117.43
Avg. #words 11.36 15.02 14.79 11.78
Avg. word frequency 15.48 4.80 8.57 16.94

We provide detailed analysis of the dataset in the supplementary material of the main paper.

Follow the instruction below to download the dataset.

Setting up an environment

We recommend creating a virtual environment and install packages there. Note, you must install the Cython package first.

python3 -m venv <your virtual environment name>
source <your virtual environment name>/bin/activate
pip3 install Cython
pip3 install -r requirements.txt

Download and prepare data

MRW

cd data
bash prepare_mrw_dataset.sh

This will download the dataset (without videos) in a JSON format, a vocabulary file, and train/val/test splits. It will then prompt an option:

Do you wish to download video data and gulp them? [y/n]

We provide two ways to obtain the data. A recommended option is to download pre-compiled data in a GulpIO binary storage format, which contains video frames sampled at 8 FPS. For this, simpliy hit n (this will terminate the script) and download our pre-compiled GulpIO data in this link (54 GB). After finish downloading, extract the tarball under data/mrw/gulp to train and/or test our models.

If you wish to download raw video clips and gulp them on your own, hit y once prompted with the message above. This will start downloading videos and, once finished, start gulping the video files at 8 FPS (you can change this in download_gulp_mrw.py). If you encounter any problem downloading the video files, you may also download them directly from this link (19 GB), and then continue gulping them using the script download_gulp_mrw.py.

TGIF

cd data
bash prepare_tgif_dataset.sh

This will download the dataset (without videos) in a TSV format, a vocabulary file, and train/val/test splits. Please note, we use a slightly modified version of the TGIF dataset because of invalid video files; the script will automatically download the modified version.

It will then prompt an option:

Do you wish to gulp the data? [y/n]

Similar to the MRW data, we provide two options to obtain the data: (1) download pre-compiled GulpIO data, or (2) download raw video clips and gulp them on your own, and we recommend the first option for an easy start. For this, simply hit n and download our pre-compiled GulpIO data in this link (89 GB). After finish downloading tgif-gulp.tar.gz, extract the tarball under data/tgif/gulp.

If you wish to gulp your own dataset, hit y and follow the prompt. Note that you must first download a tarball containing the videos before gulping. You can download the file tgif.tar.gz (124 GB) from this link and place it under ./data/tgif. Once you have the video data, the script will start gulping the video files.

MS-COCO

cd data
bash prepare_coco_dataset.sh

Evaluate pretrained models

Download all six pretrained models in a tarball at this link. You can also download each individual files using the links below.

Dataset Model Command
COCO PVSE (k=1) [download] python3 eval.py --data_name coco --num_embeds 1 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse_k1.pth
COCO PVSE [download] python3 eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse.pth
MRW PVSE (k=1) [download] python3 eval.py --data_name mrw --num_embeds 1 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse_k1.pth
MRW PVSE [download] python3 eval.py --data_name mrw --num_embeds 5 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse.pth
TGIF PVSE (k=1) [download] python3 eval.py --data_name tgif --num_embeds 1 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse_k1.pth
TGIF PVSE [download] python3 eval.py --data_name tgif --num_embeds 3 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse.pth

Using the pretrained models you should be able to reproduce the results in the table below

Dataset Model Image/Video-to-Text
R@1 / R@5 / R@10 / Med r (nMR)
Text-to-Image/Video
R@1 / R@5 / R@10 / Med r (nMR)
COCO 1K PVSE (K=1) 66.72 / 91.00 / 96.22 / 1 (0.00) 53.49 / 85.14 / 92.70 / 1 (0.00)
COCO 1K PVSE 69.24 / 91.62 / 96.64 / 1 (0.00) 55.21 / 86.50 / 93.73 / 1 (0.00)
COCO 5K PVSE (K=1) 41.72 / 72.96 / 82.90 / 2 (0.00) 30.64 / 61.37 / 73.62 / 3 (0.00)
COCO 5K PVSE 45.18 / 74.28 / 84.46 / 2 (0.00) 32.42 / 62.97 / 74.96 / 3 (0.00)
MRW PVSE (K=1) 0.16 / 0.68 / 0.90 / 1700 (0.34) 0.16 / 0.56 / 0.88 / 1650 (0.33)
MRW PVSE 0.18 / 0.62 / 1.18 / 1624 (0.32) 0.20 / 0.70 / 1.16 / 1552 (0.31)
TGIF PVSE (K=1) 2.82 / 9.07 / 14.02 / 128 (0.01) 2.63 / 9.37 / 14.58 / 115 (0.01)
TGIF PVSE 3.28 / 9.87 / 15.56 / 115 (0.01) 3.01 / 9.70 / 14.85 / 109 (0.01)

Train your own model

You can train your own model using train.py; check option.py for all available options.

For example, you can train our PVSE model (k=2) on COCO using the command below. It uses ResNet152 as a backbone CNN, GloVe word embedding, MMD loss weight 0.01 and DIV loss weight 0.1, and bacth size of 256:

python3 train.py --data_name coco --cnn_type resnet152 --wemb_type glove --margin 0.1 --max_violation --num_embeds 2 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 256

For video models, you should set the parameter --max_video_length; otherwise it defaults to 1 (single frame). Here's an example command:

python3 train.py --data_name mrw --max_video_length 4 --cnn_type resnet18 --wemb_type glove --margin 0.1 --num_embeds 4 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 128

If you use any of the material in this repository we ask you to cite:

@inproceedings{song-pvse-cvpr19,
  author    = {Yale Song and Mohammad Soleymani},
  title     = {Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval},
  booktitle = {CVPR},
  year      = 2019

Our code is based on the implementation by Faghri et al.

Notes

Last edit: Tuesday July 16, 2019

pvse's People

Contributors

clt29 avatar dependabot[bot] avatar yalesong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pvse's Issues

problems in mrw and tgif data set

The following problems occurred when I used the mrw and tgif data set. How can I solve them?
mrw:I:\Python3.6\python.exe "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py"
Loading dataset
Computing results... (eval_on_gpu=False)
Traceback (most recent call last):
File "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py", line 290, in
metrics = evalrank(model, args, split='test')
File "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py", line 214, in evalrank
img_embs, txt_embs = encode_data(model, data_loader, args.eval_on_gpu)
File "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py", line 28, in encode_data
for i, data in enumerate(data_loader):
File "I:\Python3.6\lib\site-packages\torch\utils\data\dataloader.py", line 819, in next
return self._process_data(data)
File "I:\Python3.6\lib\site-packages\torch\utils\data\dataloader.py", line 846, in _process_data
data.reraise()
File "I:\Python3.6\lib\site-packages\torch_utils.py", line 369, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "I:\Python3.6\lib\site-packages\torch\utils\data_utils\worker.py", line 178, in worker_loop
data = fetcher.fetch(index)
File "I:\Python3.6\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "I:\Python3.6\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "M:\HJH\Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval\pvse-master\data.py", line 211, in getitem
frames, meta = self.gulp[self.ids[index], start_idx:end_idx:skip]
File "I:\Python3.6\lib\site-packages\gulpio\fileio.py", line 135, in getitem
chunk_id = self.chunk_lookup[id
]
KeyError: '3R3CSYCwXTmBW'

Process finished with exit code 1

tgif:I:\Python3.6\python.exe "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py"
Loading dataset
Computing results... (eval_on_gpu=False)
Traceback (most recent call last):
File "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py", line 290, in
metrics = evalrank(model, args, split='test')
File "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py", line 214, in evalrank
img_embs, txt_embs = encode_data(model, data_loader, args.eval_on_gpu)
File "M:/HJH/Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval/pvse-master/eval.py", line 28, in encode_data
for i, data in enumerate(data_loader):
File "I:\Python3.6\lib\site-packages\torch\utils\data\dataloader.py", line 819, in next
return self._process_data(data)
File "I:\Python3.6\lib\site-packages\torch\utils\data\dataloader.py", line 846, in _process_data
data.reraise()
File "I:\Python3.6\lib\site-packages\torch_utils.py", line 369, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "I:\Python3.6\lib\site-packages\torch\utils\data_utils\worker.py", line 178, in worker_loop
data = fetcher.fetch(index)
File "I:\Python3.6\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "I:\Python3.6\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "M:\HJH\Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval\pvse-master\data.py", line 296, in getitem
frames, meta = self.gulp[self.ids[index], start_idx:end_idx:skip]
File "I:\Python3.6\lib\site-packages\gulpio\fileio.py", line 135, in getitem
chunk_id = self.chunk_lookup[id
]
KeyError: 'tumblr_n9n04znLpA1qddk8uo1_500'

Process finished with exit code 1

data url is empty or unavailable

When I run python python download_gulp_mrw.py ,some data url is invalid.

jd4lXngPg4Z0c mp4 url is empty
2H8IBNL2Ae9kQ mp4 url is empty
XKyqwYgBLxha8 mp4 url is empty
http://media1.giphy.com/media/c84Hf3KIsQ1uo/giphy is no longer available!
http://media0.giphy.com/media/Y5lzSqen6Y8nK/giphy is no longer available!

About Ranking Loss

I try to reproduce your job on coco dataset by training from scratch based on the suggested settings.
Namespace(batch_size=128, batch_size_eval=32, ckpt='', cnn_type='resnet152', crop_size=224, data_name='coco', data_path='***', debug=False, div_weight=0.1, dropout=0.0, embed_size=1024, eval_on_gpu=False, grad_clip=2.0, img_attention=True, img_finetune=True, legacy=True, log_file='***', log_step=10, logger_name='***', lr=2e-04, margin=0.1, max_video_length=1, max_violation=True, mmd_weight=0.01, num_embeds=2, num_epochs=10, order=False, txt_attention=True, txt_finetune=True, val_metric='rsum', vocab_path='***', weight_decay=0.0, wemb_type='glove', word_dim=300, workers=16)

However, I find that triplet ranking loss is converged to the margin and the final performance is bad.
How can you fix it? Any help would be appreciated. Thank you!

Not sure how MIL loss is computed

First up, thank you so much for the wonderful code. I still have one issue which I am not able to understand from the code. In the code we are dealing with MIL loss but I am not sure how it is implemented using the MaxPool idea. This may be partly due to my lack of understanding about MIL itself. If there is any resource which you think can help me here, I shall be really grateful.

Not able to load pre-trained model, or the model I trained given your instruction

Hello, First of all thanks alot for sharing such a great work. I used your code to train the model on MSR_VTT dataset, and I did that successfully, thanks for clear code and instructions. Now, I have "model_best.pth.tar" file. I also downloaded your model mrw_pvse.pth from github repo. Running eval.py is giving an error on loading model from a given path.

Even tried to load your model "mrw_pvse.pth" but the same error

" RuntimeError: Error(s) in loading state_dict for PVSE:
Missing key(s) in state_dict: "img_enc.cnn.conv1.weight", "img_enc.cnn.bn1.weight", "img_enc.cnn.bn1.bias", "img_enc.cnn.bn1.running_mean", "img_enc.cnn.bn1.running_var", "img_enc.cnn.layer1.0.conv1.weight", "img_enc.cnn.layer1.0.bn1.weight", "img_enc.cnn.layer1.0.bn1.bias", "img_enc.cnn.layer1.0.bn1.running_mean", "img_enc.cnn.layer1.0.bn1.running_var", "img_enc.cnn.layer1.0.conv2.weight", "img_enc.cnn.layer1.0.bn2.weight", "img_enc.cnn.layer1.0.bn2.bias", "img_enc.cnn.layer1.0.bn2.running_mean", "img_enc.cnn.layer1.0.bn2.running_var", "img_enc.cnn.layer1.0.conv3.weight", "img_enc.cnn.layer1.0.bn3.weight", "img_enc.cnn.layer1.0.bn3.bias", "img_enc.cnn.layer1.0.bn3.running_mean", "img_enc.cnn.layer1.0.bn3.running_var", "img_enc.cnn.layer1.0.downsample.0.weight", "img_enc.cnn.layer1.0.downsample.1.weight", "img_enc.cnn.layer1.0.downsample.1.bias", "img_enc.cnn.layer1.0.downsample.1.running_mean", "img_enc.cnn.layer1.0.downsample.1.running_var", "img_enc.cnn.layer1.1.conv1.weight", "img_enc.cnn.layer1.1.bn1.weight", "img_enc.cnn.layer1.1.bn1.bias", "img_enc.cnn.layer1.1.bn1.running_mean", "img_enc.cnn.layer1.1.bn1.running_var", "img_enc.cnn.layer1.1.conv2.weight", "img_enc.cnn.layer1.1.bn2.weight", "img_enc.cnn.layer1.1.bn2.bias", "img_enc.cnn.layer1.1.bn2.running_mean", "img_enc.cnn.layer1.1 .... so on"

Kindly guide me with this issue

About MIL loss

Hi,
In MIL loss part, you use cosin distance d(a, b)=(a dot b)/(norm(a),norm(b)) , this experssion means if a is similar with b , the d(a,b) is become large. Do you want to express this mean in MIL loss part?
If I understand wrong, please correct me. I am confused with this part.

IndexError in trainning MRW

thank you for your awesome work! when train the MRW dataset , I encounter the following problem:
File "train.py", line 236, in
main()
File "train.py", line 216, in main
loss = train(epoch, trn_loader, model, criterion, optimizer, args)
File "train.py", line 68, in train
for itr, data in enumerate(data_loader):
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/root/CV/pvse/data.py", line 230, in getitem
video = self.transform(frames)
File "/root/CV/pvse/video_transforms.py", line 58, in call
img = t(img)
File "/root/CV/pvse/video_transforms.py", line 588, in call
i, j, h, w = self.get_params(img_list[0], self.scale, self.ratio)
IndexError: list index out of range
How to fix it?thank you for your attention

Memory leak during the forward pass in eval.py.

Thanks for sharing this great work.
I found that there is a GPU memory leak during the forward pass in eval.py (line 34).
This prevents to complete the evaluation as it crashes at about 7700 samples on a 24GB GPU.
Best regards,
Georges.

About the re-implement performance

I try to reproduce your model on coco using the command:

python3 train.py --data_name coco --cnn_type resnet152 --wemb_type glove --margin 0.1 --max_violation --num_embeds 2 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 256

with the pytorch 1.1.0, torchvision 0.3.0 on single RTX2080Ti.

But the loss stays at 0.2001 since epoch 4, and the final performance is bad.
The evaluation result of your provided checkpoint is OK, so I don't think it's the problem with the environment.
Is there any suggestion on this problem?
Any help would be appreciated. Thank you!

Results are not same

Hello, I successfully loaded the pre-trained model given for mrw dataset using the following command in eval.py script

"checkpoint = torch.load(args.ckpt)
model.load_state_dict(checkpoint, strict=False)"

but the results are not same as given in the article or in github repo.

The results generated by evaluating pre-trained model "mrw_pvse.pth"

Loading dataset
Computing results... (eval_on_gpu=True)
Images: 5000, Sentences: 5000
rsum: 0.48
Average i2t Recall: 0.07
Image to text: 0.02 0.06 0.14 2509.00 (0.50) 2509.74 (0.50)
Average t2i Recall: 0.09
Text to image: 0.02 0.06 0.18 2500.00 (0.50) 2500.98 (0.50)

Am I doing any mistake

word embedding

Thanks for your job!
I try to reproduce your job and I get confused in the word embedding.
I use torchtext 0.4.0 and find 70 words missing in GloVe in Coco.
I got very bad results after training 30 epoch with the same parameters.
Though I use torch 1.2.0 and torchvision 0.4.0, I don't think they are the main reasons.
I want to know how you initialize the missing word embeddings or other problems you think I am facing.
Any help would be appreciated. Thank you!

Bidirectional GRU Bugs

I'm currently porting your code over to run a comparison against it. I came across the following issue in model.py:

pvse/model.py

Lines 269 to 278 in 19071a8

packed = pack_padded_sequence(wemb_out, lengths, batch_first=True)
if torch.cuda.device_count() > 1:
self.rnn.flatten_parameters()
rnn_out, _ = self.rnn(packed)
padded = pad_packed_sequence(rnn_out, batch_first=True)
# Reshape *final* output to (batch_size, hidden_size)
I = lengths.expand(self.embed_size, 1, -1).permute(2, 1, 0) - 1
out = torch.gather(padded[0], 1, I).squeeze(1)
out = self.dropout(out)

Let's modify the code above to preserve the final hidden state of the RNN.

>>> rnn_out, h_n = self.rnn(packed) 

Now, let's inspect. (Note, the batch size may differ in your case, as well as the hidden state size from my implementation):

>>> h_n.shape
torch.Size([2, 32, 256])
>>> padded[0].shape
torch.Size([32, 15, 512])

So, we see h_n, which is the hidden state after consuming the final input, has 2 in its first dimension, since we have a bidirectional GRU. We also observe (in my case), the hidden state size is 256, while the output is size 512 (this is because Pytorch concats the two BiGRU (forward-backward) states in the output).

Let's examine the 5th element in the batch. First, we find its length:

>>> padded[1][5]
tensor(11)

So, we know the final hidden state occurs at index 10 (and the rest are padded). We can confirm here:

>>> torch.equal(padded[0][5, 10, :256], h_n[0, 5, :])
True

Here is where the problem comes:

>>> torch.equal(padded[0][5, 10, 256:], h_n[1, 5, :])
False

In other words, what we find is that the final hidden state for the reverse direction h_n[1, 5, :], does not equal the hidden state at padded[0][5, 10, 256:].

Similarly, we can verify this in your out variable:

>>> out.shape
torch.Size([32, 512])
>>> torch.equal(out[5,:256], h_n[0, 5, :])
True
>>> torch.equal(out[5,256:], h_n[1, 5, :])
False

In other words, the first part of out matches h_n, but the second half does not.

This is because the final hidden state for the reverse direction of the GRU is not at padded[0][5, 10, 256:]. Because it is reversed, it is actually located at index 0 in padded, not at index 10!

>>> torch.equal(padded[0][5, 10, 256:], h_n[1, 5, :])
False
>>> torch.equal(padded[0][5, 0, 256:], h_n[1, 5, :])
True

In other words, I believe your implementation doesn't use the correct hidden state in the reverse direction. If your reverse GRU is processing sequence S_n, ... S_1 and producing hidden states h_1, ... h_n, your implementation only processes the last item of the sequence S_n, and ignores the rest of the sequence (in the reverse direction). The forward direction's output hidden state captures the entire input, but the reverse state hidden state only captures the last item of the input, and none of the prior input states. So, the two hidden states used aren't even capturing the same sequence.

Long story short, I think you want torch.equal(out[5,256:], h_n[1, 5, :]) to be equal, but it isn't currently.

I believe the issue is also present in VideoEncoder, here:

pvse/model.py

Lines 197 to 199 in 19071a8

states, _ = self.rnn(features)
states = self.dropout(states)
out = states[:, -1, :]

Again, it seems you are using the last index, for both directions of the GRU. This is correct for the forward case, but not backward.

Strange appearance of the loss curves

I am able to replicate the results but found something strange. I made a small tweak and also plotted the loss curves for the training and validation set. While doing so, I get a "strange" tuning-fork shaped loss. Although, the recall sum looks perfectly fine. Can you please tell me if this is the expected behavior for the same?

image
image

Your help shall be highly appreciated

Some problems about your paper

I have read your article and I am very interested in your paper. But there are some problems when I repeat your experiment: when I use the coco dataset to evaluate the pretrained model, I find that the results are not consistent with the results you mentioned in the paper. The only difference is that I use Torch==1.2.0, torchvison==0.4.0. Here are my results:

H:***\Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval\pvse-master>python eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --ckpt ./ckpt/coco_pvse.pth

Loading dataset
loading annotations into memory...
Done (t=0.41s)
creating index...
index created!
Computing results... (eval_on_gpu=CPU)
Images: 5000, Sentences: 25000
Image to text: 0.00, 0.30, 0.60, 773.00 (0.03), 1016.47 (0.04)
Text to image: 0.14, 0.52, 1.00, 500.00 (0.02), 500.83 (0.02)
rsum: 2.56 ar: 0.30 ari: 0.55
Image to text: 0.00, 0.60, 1.40, 781.00 (0.03), 1017.42 (0.04)
Text to image: 0.10, 0.48, 0.96, 501.00 (0.02), 500.82 (0.02)
rsum: 3.54 ar: 0.67 ari: 0.51
Image to text: 0.00, 0.20, 0.90, 776.00 (0.03), 987.25 (0.04)
Text to image: 0.14, 0.46, 0.98, 500.00 (0.02), 500.38 (0.02)
rsum: 2.68 ar: 0.37 ari: 0.53
Image to text: 0.10, 0.30, 0.50, 778.00 (0.03), 989.40 (0.04)
Text to image: 0.06, 0.50, 1.04, 501.00 (0.02), 500.60 (0.02)
rsum: 2.50 ar: 0.30 ari: 0.53
Image to text: 0.10, 0.50, 0.70, 778.00 (0.03), 985.18 (0.04)
Text to image: 0.10, 0.50, 1.00, 500.00 (0.02), 500.52 (0.02)
rsum: 2.90 ar: 0.43 ari: 0.53

Mean metrics from 5-fold evaluation:
rsum: 17.02
Average i2t Recall: 0.41
Image to text: 0.04 0.38 0.82 777.20 (0.03) 999.14 (0.04)
Average t2i Recall: 0.53
Text to image: 0.11 0.49 1.00 500.40 (0.02) 500.63 (0.02)
rsum: 0.58
Average i2t Recall: 0.08
Image to text: 0.04 0.06 0.14 3891.00 (0.16) 4999.44 (0.20)
Average t2i Recall: 0.11
Text to image: 0.02 0.10 0.22 2502.00 (0.10) 2501.31 (0.10)

Thank you for reading and I am looking forward to your reply

KeyError: '3R3CSYCwXTmBW'

When I run python eval.py --data_name mrw --num_embeds 1 --img_attention --txt_attention --max_video_length 4 --ckpt ./ckpt/mrw_pvse_k1.pth. Something wrong occured.

(pvse) zhyue@server-55:~/pvse$ python eval.py --data_name mrw --num_embeds 1 --img_attention --txt_attention --max_video_length 4 --ckpt ./ckpt/mrw_pvse_k1.pth
Loading dataset
Computing results... (eval_on_gpu=False)
Traceback (most recent call last):
  File "eval.py", line 290, in <module>
    metrics = evalrank(model, args, split='test')
  File "eval.py", line 214, in evalrank
    img_embs, txt_embs = encode_data(model, data_loader, args.eval_on_gpu)
  File "eval.py", line 28, in encode_data
    for i, data in enumerate(data_loader):
  File "/home/zhyue/anaconda2/envs/pvse/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/zhyue/anaconda2/envs/pvse/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 606, in _process_next_batch
    raise Exception("KeyError:" + batch.exc_msg)
Exception: KeyError:Traceback (most recent call last):
  File "/home/zhyue/anaconda2/envs/pvse/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/zhyue/anaconda2/envs/pvse/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/zhyue/pvse/data.py", line 211, in __getitem__
    frames, meta = self.gulp[self.ids[index], start_idx:end_idx:skip]
  File "/home/zhyue/anaconda2/envs/pvse/lib/python3.7/site-packages/gulpio/fileio.py", line 135, in __getitem__
    chunk_id = self.chunk_lookup[id_]
KeyError: '3R3CSYCwXTmBW'

Any suggestion?

RuntimeError: CUDA out of memory.

Thanks for your job!
Sorry to bother you again

When I run your eval.py program, no matter how small my batch_size_eval setting is, I find that the graphics memory keeps accumulating, which eventually causes the graphics card to run out of memory and the program fails. My graphics card is GTX1080TI 11G.

Any help would be appreciated. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.