This project is based on the paper: Vinyals et al. Show and Tell: A Neural Image Caption Generator. CVPR 2015. You should use one of the pretrained CNN image recognition models and the COCO dataset from torchvision: https://pytorch.org/docs/stable/torchvision/index.html On top of the recognition model, you should train an image captioning system which generates a caption sentence based on a representation vector computed by the CNN model.
Project by Wellesley Boboc, Meryem Karalioglu, & Rodrigo Lopez Portillo Alcocer