Image Caption Generating Deep Learning Model

Aishwarya Maroju; Sneha Sri Doma; Lahari Chandarlapati

doi:10.17577/IJERTV10IS090120

Volume 10, Issue 09 (September 2021)

Image Caption Generating Deep Learning Model

DOI : 10.17577/IJERTV10IS090120

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 1,410
Authors : Aishwarya Maroju , Sneha Sri Doma , Lahari Chandarlapati
Paper ID : IJERTV10IS090120
Volume & Issue : Volume 10, Issue 09 (September 2021)
Published (First Online): 04-10-2021
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Image Caption Generating Deep Learning Model

Aishwarya Maroju#1, Sneha Sri Doma#2, Lahari Chandarlapati#3

Department Of Electronics and Computer Engineering J.N.T.U, Hyderabad ,

Sreenidhi Institute of Science And Technology, Ghatkesar, Yamnampet, Hyderabad,India

AbstractImage captioning is the process of generating descriptions about what is going on in the image. By the help of Image Captioning descriptions are built which explain about the images. Image Captioning is basically very much useful in many applications like analyzing large amounts of unlabeled images and finding hidden patterns for Machine Learning Applications for guiding Self driving cars and for building software that guides blind people. This Image Captioning can be done by using Deep Learning Models. With the advancement of deep learning and Natural Language Processing now it has become easy to generate captions for the given images. In this paper we will be using Neural Networks for the image captioning. Convolution Neural Network (ResNet) is used as encoder which access the image features and Recurrent Neural Network (Long Short Term Memory) is used as decoder which generates the captions for the images with the help of image features and vocabulary that is built.

KeywordsDeep Learning; Image Captioning;Convolutional Neural Networks; Recurrent Neural Networks;ResNet;Long Short Term Memory; insert (key words)

INTRODUCTION

In earlier days Image Captioning was a tough task and the captions that are generated for the given image are not much relevant. With the advancement of Neural Networks of Deep Learning and also text processing techniques like Natural Language Processing, Many tasks that were challenging and difficult using Machine Learning became easy to implement with the help of Deep Learning and Neural Networks. These are very much useful in image recognition, Image classification, Image Captioning and many other Artificial Intelligence applications. Image Captioning is basically generating descriptions about what is happening in the given input image. Basically ,this model takes image as input and gives caption for it. With the advancement of the technology the efficiency of image caption generation is also increasing. This Image Captioning is very much useful for many applications like Self driving cars which are now talk of the town. Image captioning can be used in many Machine Learning tasks for Recommendation Systems. There are many models proposed for image captioning like object detection model, visual attention- based image captioning and Image Captioning using Deep Learning. In Deep Learning also there are different deep learning models like Inception model, VGG Model , ResNet-LSTM model, traditional CNN- RNN Model. In this paper we are going to explain about the model we have followed for captioning the images .i.e; ResNet-LSTM model
LITERATURE SURVEY

In method proposed by Liu, Shuang & Bai, Liang & Hu, Yanli & Wang, Haoran et al. [1], two models of deep learning namely, Convolutional Neural Network-Recurrent Neural Network(CNN-RNN) Based Image Captioning, Convolutional Neural Network-Convolutional Neural (CNN- CNN)Based Image Captioning. In CNN-RNN Based frame work, Convolutional Neural Networks for encoding and Recurrent Neural Networks for the decoding process. Using CNN the images here are converted to vectors and these vectors are called image features these are passed into Recurrent neural networks as input. In RNNs se NLTK libraries are used to get the actual captions for the project. In the CNN-CNN based frame work only CNN is used for both encoding and decoding of the images. Here vocab dictionary is used and it is mapped with Image features to get the exact word for the given image using NLTK library. Thus generating the error free caption. Consisting of many models that are given at the same time of convolution techniques simultaneously is certainly quicker compared to the train the continuous flowing recurrently repetition of this techniques. CNN-CNN Model has less training time as compared to the CNN-RNN Model. The CNN-RNN Model has more training time as it is sequential but it has less loss compared to the CNN-CNN Model.

In the method proposed by Ansari Hani et al[2] Here they have used encoding decoding model for image captioning. Here they have mentioned two more models for image captioning they are: Retrieval based captioning and template based captioning. Retrieval based captioning is the process where training images are placed in one space and their corresponding captions which are generated are placed in another scope now in the new scope the correlations are calculated for the test image and captions the highest valued correlation caption is retrieved as caption for the given image from the given set of captions dictionary. Prototype based descripting is the technique is done by them in this paper

.Here they have used Inception V3 model as their encoder and they have used attention mechanism and GRU as their decoder to generate the captions.

In the method proposed by Subrata Das, Lalit Jain et al[3] This model is mainly based on how the deep learning models are used for Military Image captioning. It mainly uses CNN- RNN based frame work.They have used Inception model for encoding the images and to decrease the gradient descent problem they have used Long Short Term Memory (LSTMS) Networks.

In the method proposed by G Geetha et al[4] they have used CNN-LSTM model for image captioning. The entire flow of the model was explained from data set collection to caption generation. Here Convolutional Neural Networks was used as encoder and LSTMs was used as decoder for generating the captions
DISADVANTAGES OF EXISTING MODEL

As we have seen in the literature survey there are many drawbacks of the existing model. Each existing model has its own disadvantage making the model less efficient and less accurate when the results are generated. The observed drawbacks in all the existing models are as follows:
1. In CNN-CNN based model where CNN is used for both encoding and decoding purpose we observe that CNN-CNN model has high loss which is not acceptable as the generated captions wont be accurate and the captions generated here will be irrelevant to the given test image.
2. While in the case of CNN-RNN based captions there might be less loss compared to the CNN-CNN based model but the training time is more. Training time effects the whole efficiency of the model and here we also encountered another problem. i.e.; Vanishing Gradient Problem. Gradient is the parameter which is used to calculate the rate of loss per the given input parameter comparing both inputs and outputs.
This Gradient Descent Problem occurs mainly in Artificial Neural Networks and Recurrent Neural Networks. Gradient is the ratio of change in the weights with respect to change in the error in the output of the neural network. This gradient is also considered as slope of the activation function of the neural network. If the slope is high then the training for the model is faster and the neural network model learns faster. As the hidden layers increases the loss increase whereas gradient decreases and finally gradient becomes zero. This gradient problem hinders the learning of long term sequences in Recurrent Neural Networks. This Gradient Descent problem hinders the RNN in learning and remembering process. The words cannot be stored in hidden memory for long term use

.Hence it becomes hard for the RNN to analyze the captions for the given image during the training purpose. RNN cannt store the words of larger captions for longer period due gradient descent problem during the training time, As the number of hidden layers increases the gradient starts to decrease and finally it reaches to zero where the hidden key words in the captions are sent to forget gate of RNN. Hence CNN-RNN model be trained efficiently for generating captions for the images. Finally we can conclude that as RNN have gradient descent problem generation of captions for the images using CNN-RNN model is not efficient and accurate.
PROPOSED MODEL

As we have observed that using traditional CNN-RNN model there is vanishing gradient problem which hinders the Recurrent Neural Network to learn and get efficiently trained. So in order to reduce this gradient descent problem ,In this paper we are proposing this model so as to increase the efficiency of generating captions for the image and also to

increase the accuracy of the captions. Given below is the architecture for our proposed model.[Figure 1]
Fig 1:Architecture of ResNet-LSTM Model

In this paper, We are going to explain Resnet-LSTM model for the image captioning process. Here Resnet Architecture is used for encoding and LSTMs are used for decoding .Once when the image is sent to Resnet (Residual Neural Network) it extracts the image features then with the help of vocabulary that is built using training captions data ,We will now train the model with these two parameters as input .After training

,We will test the model. Given below is the flow diagram of our proposed model in this paper.[Figure 2]
Fig 2:Model Implementation Flow Chart
The output of ResNet(Image feature vector) and vocabulary built by using training data set captions are passed to Long Short Term Memory Networks to generate captions. When we pass image feature vector and vocabulary as input to first layer of LSTM, it generates the first word of the caption using training knowledge. The next words of a caption are generated with the help of image feature vector and previously generated words. Finally, all these words are concatenated to generate the caption for the given image.

Long short term memory cells are the advanced RNNs which can remember data from long periods. This Long Short Term Memory Networks can overcome the problem of vanishing gradient which exists in Recurrent Neural Networks. In traditional RNNs they cannot remember long sequence of data due to vanishing gradient problem. So in the case of caption generation RNNs cannot remember important words that are generated previously and which are required for generation of future words. For example in the case of predicting the last word of this sentence,I am from France. I speak very fluently in French. It is important to remember the starting word France which is not possible in case of traditional RNN but Long Short Term Memory Networks do not have this issue. So LSTMs are preferred for caption generation compared to traditional RNNs .

Long Short Term Memory Networks have cells which consists of various gates like input gate, forget gate and output gate.[figure 4]
Fig 4:LSTM Cell Architecture

In the figure xt is the input to the cell and ht-1 is the output that is remebered from the previous layer and ht is the output of the present cell.The first step in LSTM is deciding what we have to forget this is decided by sigmoid function. It takes Ht- 1 and Xt as inputs and give value 1(keep it as it is dont forget or 0 throw away all the matter ).It is represented by the given equation where f(t) is the forget gate,f(t)=(Wf.[Ht- 1,Xt]+bf).

After deciding about what data we have to forget using the forget gate now we should decide what information have to be stored in Ht of the cells state for long time series data processing. It is divided into two sub parts where sigmoid () Neural network layer is the layer which decides what values need to be modified. And the second step is the tan h layer that creates vector of modifies new values that are added to cell state. The cell state carries the information from one cell to other cell. These steps are represented by the given formulas:

It=(Wi.[Ht-1,Xt]+bi),Ct=tanh(Wc.[Ht-1,Xt]+bc).

Then we update the cell by using the given formula

:Ctt=f(t)*Ct-1+It*Ct

Finally our output is updated by the given equations: Ot=(Wo[Ht-1,Xt]+bo) and Ht=Ot*tanh(Ctt).In this way during the process of training the captions are processed like this in the Long Short Term Memory and the words generated at each cell state are passed into the next cell states finally LSTMs concatenate all the words and generate the caption for the given images.
RESULT ANALYSIS

After defining and fitting the model. We trained our model for 50 epochs. It is observed that during the initial epochs of training the accuracy is very low and the captions generated are not much related to given test images. If we train the model for atleast 20 epochs then we have observed that the captions generated are some what related to the given test images. If the model is trained for 50 epochs we observe that the accuracy of the model increases and the captions generated are much related to the given test images as follows in the following figures.[figure 5] [figure 6]
Fig 5: Caption generated for given test image

Fig 6: Caption generated for given test image
CONCLUSION

Image captioning deep learning model is proposed in this paper. We have used RESNET-LSTM model to generate captions for each of the given image. The Flickr 8k data set has been used for the purpose of training the model. RESNET is the architecture of convolution layer. This RESNET architecture is used for extracting the image features and this image features are given as input to Long Short Term Memory units and captions are generated with the help of vocabulary generated during the training process. We can conclude that this ResNet-LSTM model has higher accuracy compared to CNN-RNN and VGG Model. This model works efficiently when we run the model with the help of Graphic Processing Unit. This Image Captioning deep learning model is very much useful for analyzing the large amounts of unstructured and unlabeled data to find the patterns in those images for guiding the Self driving cars, for building the software to guide blind people.

FUTURE SCOPE

In our paper we have explained about generating captions for the images. Even though deep learning is advanced upto now exact caption generation is not possible due to many reasons like hard ware requirements problem, no proper programming logic or model to generate the exact captions because machines canot think or make decisions as accurately as human do. So in future with the advancement of hardware and deep learning models we hope to generate captions with higher accuracy.It is also thought to extend this model and build complete Image-Speech conversion by converting captions of images to speech.This is very much helpful for blind people.

REFERENCES
1. Liu, Shuang & Bai, Liang & Hu, Yanli & Wang, Haoran. (2018). Image Captioning Based on Deep Neural Networks. MATEC Web of Conferences. 232. 01052. 10.1051/matecconf/201823201052.
2. A. Hani, N. Tagougui and M. Kherallah, "Image Caption Generation Using A Deep Architecture," 2019 International Arab Conference on Information Technology (ACIT), 2019, pp. 246-251, doi: 10.1109/ACIT47987.2019.8990998.
3. S. Das, L. Jain and A. Das, "Deep Learning for Military Image Captioning," 2018 21st International Conference on Information
  
  Fusion (FUSION), 2018, pp. 2165-2171, doi: 10.23919/ICIF.2018.8455321.
4. GGeetha,T.Kirthigadevi,G GODWIN Ponsam,T.Karthik,M.Safa, Image Captioning Using Deep Convolutional Neural Networks(CNNs) Published under licence by IOP Publishing Ltd in Journal of Physics :Conference Series ,Volume 1712, International Conference On Computational Physics in Emerging Technologies(ICCPET) 2020 August 2020,Manglore India in 2015.
5. Simonyan, Karen, and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
6. Donahue, Jeffrey, et al. Long-term recurrent convolutional networks for visual recogni-tion and description. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
7. Lu, Jiasen, et al. Knowing when to look: Adaptive attention via a visual sentinel for im-age captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 6. 2017.
8. Ordonez, Vicente, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images us-ing 1 million captioned photographs. Advances in neural information processing systems. 2011.
9. Chen, Xinlei, and C. Lawrence Zitnick. Minds eye: A recurrent visual representation for image caption generation. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
10. Feng, Yansong, and Mirella Lapata. How many words is a picture worth? automatic caption generation for news images. Proceedings of the 48th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.
11. Rashtchian, Cyrus, et al. Collecting image annotations using Amazons Mechanical Turk. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazons Mechanical Turk. Association for Computational Linguistics, 2010

Image Caption Generating Deep Learning Model

Leave a Reply