Giter Site home page Giter Site logo

image-captioning's Introduction

Image-Captioning

PBI project inspired by Andrej Karpathy's CS231n

Datasets used

Flickr8k Flickr30k MSCOCO

Since Flickr30k and MSCOCO require large computing power, I might upload the resullts and notebook later. The technique is same as for Flickr8k. Even though I have some decent result for Flickr30k and MSCOCO, I don't feel like uploading it at the moment.

Tools and libraries needed - Tensorflow, keras, python 3.xx, numpy, Jupyter notebook

First, clone or download the repository in your machine

You will need to download InceptionV3 weights and our model weights beforehand and store them in their respective folders, details of which are given in Description file of the respective directories. Though you can download them from here

Summary and Conclusions

Our project involved collecting, pre-processing, training and inference the data. We used different datasets of different sizes. We tried on different models like we first used LSTM in place of GRU. We experimented with various hyper-parameters like learning rate and batch size. When we started getting decent results we continued with the model and hyper-parameters to train the model for larger epochs. Finally we were able to build a webapp that could generate captions to any given image. We tried deploying the webapp on Internet but due to limited availability of computational power and money, we were unable to do so.

Initially when the model was trained less and with LSTM, it used to predict the colour wrong. Sometimes there was only one man in the caption while there were many in the picture. Also using normal argmax prediction, the model used to get stuck in the loop. After using Beam prediction the problem was resolved. By using GRU and training it for more epochs, the model started predicting more accurate captions. Since the dataset has many images of dogs in it, our model works very well on images of dog but at the end it began generating good captions even with humans in it.

Our model gives state of the art captions to any new image outside of the training data. Our final model is trained only on Flick8k dataset. With the availability of more computation power, we believe that our model can outperform several of the related models as our model is able to generate accurate captions without much need of data pre-processing. We have used Bidirectional GRU which no one has used for Image Captioning as far as we know.

There is a lot of scope for future work in Image Captioning. It is always difficult to generalize the model. If we have a lot of animal pictures in our dataset, our model might be biased to dogs and will perform poorly if there are humans in the image. Thus a huge amount of datasets are involved for the better model. Huge datasets implies faster computation and larger memory. This makes it difficult to carry out the research in the field. One can come up with efficient models but it requires a lot of time and resources to test the models.

--

Nilay Shrivastava

Kushagra Bhatnagar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.