Image Captioning System using Recurrent Neural Network-LSTM

Alok Mathur

doi:10.17577/IJERTV11IS020129

Volume 11, Issue 02 (February 2022)

Image Captioning System using Recurrent Neural Network-LSTM

DOI : 10.17577/IJERTV11IS020129

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 1,485
Authors : Alok Mathur
Paper ID : IJERTV11IS020129
Volume & Issue : Volume 11, Issue 02 (February 2022)
Published (First Online): 25-02-2022
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Image Captioning System using Recurrent Neural Network-LSTM

Alok Mathur

B.Tech CSE Student, VIT, Vellore Pune, India

Abstract Image captioning may benefit the area of retrieval, by allowing us to sort and request pictorial or image- based content in new ways. Mapping the space between images and language, may resonate with some deeper vein of progress which, once unearthed, could potentially lead to more sophisticated machines. Generating captions for images is a vital task relevant to the area of both Computer Vision and Natural Language Processing. The main challenge of this task is to capture how objects relate to each other in the image and to express them in a natural language (like English). Image captioning has a variety of uses, including editing software recommendations, virtual assistants, image indexing, accessibility for visually impaired people, social media, and other natural language processing applications. We aim to achieve this task by employing a model based on recent advances in computer vision and machine translation.

KeywordsImage Captioning, Deep Learning, Neural Network

INTRODUCTION

Image captioning is a branch of artificial intelligence research that focuses on recognising natural scenes and explaining them in natural language. Image captioning could help with retrieval by allowing us to organise and request pictorial or image-based content in new ways. Mapping the space between visuals and text could be a sign of deeper progress. Creating captions for images is an important task in both the fields of computer vision and natural language processing. The major goal of this work is to capture how items in the image interact with one another and represent them in plain language (like English).

To create lexically rich text descriptions for photos, neural networks can be employed. We've added an attention mechanism that can recognise what a word in an image refers to. This will be an effective tool for utilising large amounts of unformatted image data.
LITERATURE REVIEW
1. Image Captioning based on Deep Reinforcement Learning
  
  This paper proposes a model in contrast to existing sequential models such as Recurrent Neural Networks (RNNs). It utilizes two networks called policy network and value network to collaboratively generate the captions of images. Based on reinforcement learning, the policy network gives some possible actions about the object. Then, the value network makes decisions on whether to choose the action given by the policy network based on the evaluated reward scores. The policy network consists of two networks, a Convolutional Neural Network (CNN) and a Recurrent Neural
  
  Network (RNN), which provides a probability for the agent to take actions at each state. The value 9 network contains three parts, a CNN, a RNN and a Linear Mapping Layer, in which a perceptron model is utilized to evaluate the predictions to choose the most suitable Image Captioning based on Deep Reinforcement Learning
2. Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs
Image captioning aims to generate natural language sentences to describe the salient parts of a given image. Although neural networks have recently achieved promising results, a key problem is that they can only describe concepts seen in the training image- sentence pairs. Efficient learning of novel concepts has thus been a topic of recent interest to alleviate the expensive manpower of labeling data. Image-Text Surgery, to synthesize pseudo image-sentence pairs is proposed. The pseudo pairs are generated under the guidance of a knowledge base, with syntax from a seed data set (i.e., MSCOCO) and visual information from an existing large- scale image base (i.e., ImageNet). Via pseudo data, the captioning model learns novel concepts without any corresponding human-labelled pairs. adaptive visual replacement is introduced, which adaptively filters unnecessary visual features in pseudo data with an attention mechanism. approach on a held-out subset of the MSCOCO data set is evaluated. The experimental results demonstrate that the proposed approach provides significant performance improvements over state-of-the- art methods in terms of F1 score and sentence quality. An ablation study and the qualitative results further validate the effectiveness of the approach.
[3 ] Boosting Image Captioning with Attributes Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. it presents Long Short Term Memory with Attributes (LSTM-

A) – a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. Particularly, the learning of attributes is strengthened by integrating inter-attribute correlations into Multiple Instance Learning (MIL). To incorporate attributes into 13 captioning, it constructs variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy

relationship between them. Extensive experiments are conducted on COCO image captioning dataset and their framework shows clear improvements when compared to state-of-the-art deep models. More remarkably, it obtains METEOR/CIDEr-D of 25.5%/100.2% on testing data of widely used and publicly available splits in when extracting image representations by GoogleNet and achieve superior performance on COCO captioning Leaderboard.
1. Image Captioning with Scene-graph Based Semantic Concepts
  
  Different from existing approaches for image captioning, in this paper, they explore the cooccurrence dependency of high-level semantic concepts and propose a novel method with scene-graph based semantic representation for image captioning. To embed scene graph as an intermediate state, the task of image captioning into two phases, called concept cognition and sentence construction respectively. a vocabulary of semantic concepts is built and propose a CNN-RNN-SVM framework to generate the scene- graph- based sequence, which is then transformed into a bit vector, as the input of RNN in the next phase. They evaluate their method on FLICR dataset. Experimental results show that the approaches obtain a competitive or superior result to the state-of the-arts.
2. Image Captioning with Word Level Attention
  
  Image captioning is an attractive and challenging task to perform automatic image description and a number of works are designed for this task. these researches are based on convolutional neural network (CNN) and recurrent neural network (RNN), where the primary input to language model for word prediction at the current time step is usually the linguistic word generated at the previous time step. In this work, a novel word level attention layer is designed to process image features with two modules for accurate word prediction. The first is a bidirectional spatial embedding module to handle feature maps, 14 then the second module employs attention mechanism to extract word level attention which will be fed into language model. The experimental results on the benchmark MSCOCO dataset demonstrate that the proposed model achieves the state-of-the-art performances with 106.0 on CIDEr and 34.0 on B-4.
3. A parallel-fusion RNN-LSTM architecture for image caption generation
  
  The models based on deep convolutional networks and recurrent neural networks have dominated in recent image captiongeneration tasks. Performance and complexity are still eternal topic. By combining the advantages of simple RNN and LSTM, a novel parallel fusion RNN-LSTM architecture is presented, which obtains better results than a dominated one and improves the efficiency as well. The proposed approach divides the hidden units of RNN into several same-size parts, and lets them work in parallel. Then, their outputs are merged with corresponding ratios to generate final results. Moreover, these units can be different types of RNNs, for instance, a simple RNN and a
  
  LSTM. By training normally using Neural Talk platform on Flickr8k dataset, without additional training data, better results are obtained than that of dominated structure and particularly, the proposed model surpass Google NIC in image caption generation.
4. Image Captioning With Visual-Semantic Double Attention
  
  A novel Visual-Semantic Double Attention (VSDA) model is proposed for image captioning. In their project, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new Semantic Attention (SEA) model is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks, resulting in abundant irrelevant semantic features. In contrast, at each 16 timestep, their model selects the most relevant word that aligns with current context. In other words, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, their approach solves the problem that visual attention models cannot boost generating non-visual words. Considering that visual and semantic features are complementary to each other, their model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on famous datasets: MS COCO and Flickr30k. The results show that VSDA outperforms other methods and achieves promising performance.
5. Image Captioning with Affective Guiding and Selective Attention
  
  Image captioning is an increasingly important problem associated with artificial intelligence, computer vision, and natural language processing. In this article, an image captioning model with Affective Guiding and Selective Attention Mechanism named AGSAM is proposed. In this approach it is aimed to bridge the affective gap between image captioning and the emotional response elicited by the image. First, an effective component that captures higher-level concepts encoded in images into AG-SAM is introduced. Hence, this proposed language model can be adapted to generate sentences that are more passionate and emotive. In addition, a selective gate acting on the attention mechanism controls the degree of how much visual information AG-SAM needs. Experimental results have shown that this model outperforms most existing methods, clearly reflecting an association between images and emotional components that is usually ignored in existing works.
6. Image Captioning using Deep Neural Architectures Automatically creating the description of an image using any natural language sentences is a very challenging task. It requires expertise in both image processing as well as natural language processing. This paper discusses different available models for image captioning tasks. The
  
  advancement in the task of object recognition and machine translation has greatly improved the performance of image captioning models in recent years has been discussed. In addition to that it's been discussed how this model can be implemented. At the end, evaluation of the performance of the model using standard evaluation matrices is done.
7. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward
Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, a unique decision making framework for image captioning is introduced by providing the confidence of predicting the next word according to the 18 current states. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. Both networks are trained using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state- induced. A policy network and a value network is utilized to collaboratively generate captions. The policy network serves as a local gf-the-art approach across different evaluation metrics.
GAPS AND ISSUES The methods have three main issues:
1. They are trained using maximum likelihood estimation and back- propagation approaches. In this case, the next word is predicted given the image and all the previously generated ground truth words. Therefore, the generated captions look- like ground-truth captions. This phenomenon is called the exposure bias problem.
2. Evaluation metrics at test time are non-differentiable. Ideally sequence models for image captioning should be trained to avoid exposure bias and directly optimize metrics for the test time. In the actor-critic-based reinforcement learning algorithm, critique can be used in estimating the expected future reward to train the actor.
3. A CNN and RNN based combined network generates captions. We have also included LSTM to enhance the prediction accuracy of our model. LSTMs provide us with a large range of parameters such as learning rates, and input and output biases. Hence, no need for fine adjustments. The complexity to update each weight is reduced to O(1) with LSTMs, similar to that of Back Propagation Through Time (BPTT), which is an advantage of our proposed method.
PROPOSED WORK

Fig 4.1: Showing the brief architecture which contains the high level sub-modules

Fig 4.2: Showing the detailed architecture of our proposed

model
Image captioning can be divided into three components. The first component includes cleaning the data which consists of captions from the training dataset. Then Start and End tokens are added to the start and end of each caption. The second component includes data processing. Encoding of captions is done by indexing every word to a number and vice-versa. Images are converted to their corresponding feature vectors. Generator function is used to train and test the model wherein the two images and their captions will be used to train and one image to test the model. The third component includes the Decoder which is used in prediction. For prediction, an image vector is provided along with the start token. The model then predicts the next word using LSTM. Then the model uses the predicted words to predict further. This process continues until the end token is predicted.
IMPLEMENTATION

There are many steps that need to be done before the model is trained to predict the captions based on the images.

5.1 Data Collection:

There are many open source datasets available like Flickr 8k (containing 8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc. The Flickr 8k dataset is used in this project. This dataset contains 8000 images each with 5 captions.
RESULTS AND DISCUSSION

As we generate a caption, word by word, you can see the model's gaze shifting across the image. This is possible because of its Attention mechanism, which allows it to focus on the part of the image most relevant to the word it is going to utter next. Here are some captions generated on test images not seen during training or validation. These images are a part of the Flickr8K dataset. To understand how good the model is, lets try to generate captions on images from the test dataset (i.e. the images which the model did not see during the training).

Figure 1:- Output 1

Figure 2 :- Output 2
NOVELTY

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn bout the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. Using one of

the pre-trained models from Keras application APIs, We have done Transfer Learning which gives quite impressive results. Machine learnings potential is often stymied by its heavy reliance on large amounts of high-quality data (for supervised learning this data must be well-labeled to boot). Unfortunately, these sorts of data sets are increasingly proprietary or prohibitively expensive to access and thats when the necessary data exists at all. Transfer learning allows developers to circumvent the need for lots of new data. A model that has already been trained on a task for which labeled training data is plentiful will be able to handle a new but similar task with far less data. Using a pre- trained model often speeds up the process of training the model on a new task, and can also result in a more accurate and effective model overall. When we augment the knowledge of an isolated learner (also known as an ignorant learner) with knowledge from a source model, the baseline performance might improve due to this knowledge transfer. Utilizing knowledge from a source model might also help in fully learning the target task, as compared to a target model that learns from scratch. This, in turn, results in improvements in the overall time taken to develop/learn a model.
CONCLUSION AND FUTURE WORK Automatic image captioning is a relatively new task, thanks to the efforts made by researchers in this field, great progress has been made. In our opinion there is still much room to improve the performance of image captioning. First, with the fast development of deep neural networks, employing more powerful network structures as language models and/or visual models will undoubtedly improve the performance of image description generation. Second, because images are consisted of objects distributed in space, while image captions are sequences of words, investigation on presence and order of visual concepts in image captions are important for image captioning. Furthermore, since this problem fits well with the attention mechanism and attention mechanism is suggested to run the range of AI related tasks, how to utilize attention mechanism to generate image cations effectively will continue to be an important research topic. Third, due to the lack of paired image-sentence training set, research on utilizing unsupervised data, either from images alone or text alone, to improve image captioning will be promising. Fourth, current approaches mainly focus on generating captions that are general about image contents. However, to describe images at a human level and to be applicable in real- life environments, image description should be well grounded by the elements of the images. Therefore, image captioning grounded by image regions will be one of the future research directions. Fifth, so far, most of the previous methods are designed to image captioning for generic cases, while task- specific image captioning is needed in certain cases. Research on solving image captioning problems in various special cases will also be interesting. As a future scope, using a larger dataset would be a better option. Changing the model architecture, e.g. include an attention module could be a better solution for image captioning process. Doing more hyper parameter tuning (learning rate, batch size, number of layers, number of units, dropout rate, batch normalization etc.) and using the cross validation set to understand

overfitting can be beneficial. By using Beam Search instead of Greedy Search during Inference and by using BLEU Score to evaluate and measure the performance of the model could give more accurate results.

ACKNOWLEDGMENT

The authors are thankful to VIT, Vellore for giving opportunity to write about such a great and interesting topic.

REFERENCES

Shi, H., Li, P., Wang, B., & Wang, Z. (2018, August). Image captioning based on deep reinforcement learning. In Proceedings of the 10th International Conference on Internet Multimedia Computing and Service (pp. 1-5).
Fu, K., Li, J., Jin, J., & Zhang, C. (2018). Image-text surgery: Efficient concept learning in image captioning by generating pseudopairs. IEEE transactions on neural networks and learning systems, 29(12), 5910-5921.
Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4894-4902).
Gao, L., Wang, B., & Wang, W. (2018, February). Image captioning with scene-graph based semantic concepts. In Proceedings of the 2018 10th International Conference on Machine Learning and Computing (pp. 225-229).
Fang, F., Wang, H., & Tang, P. (2018, October). Image captioning with word level attention. In 2018 25th IEEE International Conference on Image Processing (ICIP) (pp. 1278-1282). IEEE.
Wang, M., Song, L., Yang, X., & Luo, C. (2016, September). A parallel-fusion RNN-LSTM architecture for image caption generation. In 2016 IEEE International Conference on Image Processing (ICIP) (pp. 4448-4452). IEEE.
He, C., & Hu, H. (2019). Image captioning with visual-semantic double attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1), 1- 16.
Wang, A., Hu, H., & Yang, L. (2018). Image captioning with Affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(3), 1-15.
Shah, P., Bakrola, V., & Pati, S. (2017, March). Image captioning using deep neural architectures. In 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS) (pp. 1-4). IEEE.
Ren, Z., Wang, X., Zhang, N., Lv, X., & Li, L. J. (2017). Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 290-298).
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1-36.

Image Captioning System using Recurrent Neural Network-LSTM

Leave a Reply