Giter Site home page Giter Site logo

vat0599 / automated-image-captioner Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 2.88 MB

Reads and gives a textual description of an image using concepts of NLP and CNN combined with RNN.

License: MIT License

Jupyter Notebook 100.00%
jupyter-notebook python3 nlp computer-vision text-analysis keras

automated-image-captioner's Introduction

Automated Image Captioner

The idea of this program is to read an image and give a textual description of it. It uses concepts of NLP, CNN and RNN to process the information. This was designed using Python3 on Jupyter Notebooks.

Steps to follow:

  • In this project we have used Flickr 8k dataset for training and testing the model. Click this link for downloading the dataset and this link for downloading the caption folder.
  • Put these two folders along with this repo's jupyter notebook on the same folder.
  • Now that you are ready with the setup, go to the jupyter notebook code and edit the path directories present in second cell.
  • You are ready to go, run the cells serially, and the training begins at some point. The training takes fairly long time if you are using CPU.
  • After that you can test with any image input, and observe the results.

How the code works?

The code is fairly straightforward. First, we take Resnet (a convolutional neural network used to understand image input by converting it into a feature vector) to capture the feature vector of all the input image. This is important because, once the model understands the contents present inside the image, it'll be easy to map it with the caption text present.

After we get the long vector containing features of all the images, we then do some preprocessing to the caption text. The words in the captions are converted to indices, which are vectorized to one hot encodings. After padding appropriately, we have a vector representing the texts, as well as a vector representing the input image features.

Now, we form a dual model, which runs parallelly, one for images and other captions. As we can see in the model description below, the first model takes encoded image vector as an input, and passes through dense layers followed by repeat vector; while the language model takes, the vectorized captions as input vector and passes through LSTM (a recurrent neural network which serves the purpose of understanding sentences and their meaning and context) and a time distribution which allows us to apply a layer to every temporal slice of an input.

Finally after these 2 models are parallely processed they are now combined as shown above. The outputs from the image model and the language model is concatenated and passed through 2 LSTM networks followed by dense NNs. Finally a softmax activation is used giving a flattened output vector. Overall RMSprop is used as the optimizer.

During testing, the model looks at the objects and people present in the image and matches to the possible words and strings up a sentence. At times the sentence doesn't make complete sense, or has grammatical error. This is because some objects in the image might resemble to some other object during training leading to some misclassification. Or there were not enough sentences for the neural net to understand sentence formation thus leading to broken sentences. Overall the model performs well and captured quite relevant information.

Some output screenshots

While the outputs from most of the images turned out to be good, some didnt make sense. As explained above in algo flow section, it is mainly because of over resemblance to some images and less number of captions.

Contributions

Do not hesitate to contribute by filling an issue or a PR !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.