RECVIS18 Final project on Image Captioning with Region Attention based on the Neural Baby Talk (NBT) paper. An implementation of NBT can be found on this repository.
Besides reproducing the paper experiment on the Flickr30k dataset, our goal for this project is to apply the NBT approach to a dataset of unusual visual relations named UnRel. This requires annotating a set of 115 images from the UnRel dataset.
This is a joint effort undertaken by all the groups that have selected this subject.
The goal is to caption a subset of the UnRel dataset and to produce 3 captions for each image. The captions should be:
- General description focusing on relationships between objects (spatial relationships, actions)
- Caption focusing on the attributes (color, pose, behavior) of the subject.
- Most salient spatial relationship.
This section focuses on how to adapt the existing NBT code to be able to produce captions on the subset of annotated UnRel images. Note that to properly function, NBT needs a vision model to output region proposals alongside the detected category, a vocabulary (textual words) and a set of visual words, as well as the ground truth captions to evaluate the captioning task.
In this experiment, I used the ground truth proposals available in the UnRel dataset. I then used Flickr30k vocabulary and created a manual mapping for UnRel categories that were missing in the Flickr30k dataset.
I was able to reverse-engineer most of the steps required to produce the intermediate files used by the model to generate sentences. It includes:
- Extracting the ground truth proposals from UnRel using the
annotations.mat
file. - Manually map missing categories in Flickr30k that are present in UnRel ground truth detections.
- Reformat the proposals so that they exhibit, for each proposal:
(x_min, y_min, x_max, y_max, detection_index, confidence)
- Produce a
dataset_unrel.json
file similar todataset_flickr30k.json
file to gather the captions and images data. - Produce a
cap_unrel.json
similar tocap_coco.json
file to make the ground truth captions available. - Produce a
dic_unrel.json
based ondic_flickr30k.json
to provide the vocabulary to the language model. - Refactor the original code from
demo.py
todemo_unrel.py
to be able to run the demonstration on the UnRel dataset. - Refactor the original code from
dataloader_flickr.py
todataloader_unrel.py
to be able to load the UnRel dataset so that it can be fed to the NBT model as expected. - Refactor the original code from
main.py
toeval_unrel.py
so that the language evaluation can be performed on the UnRel dataset. - Refactor the original configuration file
cfgs/normal_coco_res101.yml
tocfgs/normal_unrel_res101.yml
. I also provide the missingcfgs/normal_flickr30k_res101.yml
in this repository. - Produce a
caption_unrel.json
file similar tocaption_flickr30k.json
intools/coco-caption/annotations/
for the language evaluation to be performed on the UnRel dataset.
- We provide the results in the
visu
folder. - Results table can be found in the
report
folder. - We provide the pre-processed files in the
data
folder. - We provide the configuration files in the
cfgs
folder. - We provide the notebooks used to pre-process the files.