Image Captioning using Luong Attention and SentencePiece Tokenizer
In this blog post, I will be explaining about the Image Captioning project I did as part of Data Mining course (CSE 5334) during MS in CS at University of Texas at Arlington.
Introduction
The aim of the project was to create a Image captioning system that would take in an image and predict the caption associated with it.
- The HTML version of the Jupyter Notebook can be accessed here.
- The Youtube link for the demo can be found here.
Dataset Collection and Pre-processing
The dataset used for training the model is obtained from https://cocodataset.org/#home. The 2014 version of the dataset was used, which contained approximately 82000 images, where each image had five different captions. To fit the dataset in the RAM, a subset of the dataset was used where the buffer size is 10000 images. Each pixel value was pre-processed to the [-1, 1] range, and the features were extracted using the Inception V3 model [1]. For each image, the extracted image had a shape of (1, 64, 2048). The caption was pre-processed to remove unwanted HTML markups, convert all characters to lowercase letters, convert all Unicode characters to ASCII format. After splitting the dataset, Training set had 49810 image-caption pairs, Validation set had 1311 image-caption pairs, and Testing set had 1311 image-caption pairs.
Tokenization
The vocabulary size of a language cannot be limited. It keeps growing daily. Also, if the dataset used is small, then no data can be dropped; in other words, rare words cannot be dropped. A solution to this would be to use a tokenizer that would reduce the vocabulary size and use content from the rare words. For this purpose, we use the SentencePiece model. It was introduced by Wu et al. in [2], where the authors used a data compression technique that would replace a rare word with the most frequent pair of bytes. In this project, I trained a unigram model on the captions from the training set with a vocabulary size of 2048.
Methodology
The model was trained using an architecture similar to the work in Seq2Seq architecture in [3]. The Seq2Seq model consists of 2 sub-modules: An Encoder model and a Decoder model. The Encoder takes input from one language and encodes it with Recurrent Neural Networks (RNN). The results are passed into the decoder, which then decodes it with the help of RNN to the output language and produces the result.

Bahdanau et al. in [5] introduced the concept of Attention to the Seq2Seq architecture, as the normal Seq2Seq failed to work for longer sentences (more than 40 words). The paper’s idea was that each time the model has to predict the target language word, it is given relevant input and the previous output. The authors proved that this produced better results than the previous Seq2Seq model for longer sentences.

Luong et al. in [7] provided different approach to the Attention model. The Bahdanau Attention model was considered to be a Local Attention model, whereas the Luong Attention is a Global Attention model. The Luong Attention model aims to take all the encoder’s hidden states as input for deriving context vector. The difference between the Luong Attention model and Bahdanau Attention model is that the Bahdanau Attention model takes the only output of the previous time step, but the Luong Attention model takes the output of the last Long Short-Term Memory (LSTM) layer of both the Encoder and Decoder layers for calculating the context vector.

Hyperparameter Tuning and Results
Various hyperparameters were tried, such as Bahdanau Attention or Luong Attention. Number of layers of LSTM in decoder {1, 2, 4}.
Luong Attention model provided better results compared to Bahdanau Attention, where the Luong attention model produced an accuracy of 24.5%, whereas Bahdanau attention model produced an accuracy of 19.4%.
One layer of LSTM produced an accuracy of 20.4%, 2 layers of LSTM produced an accuracy of 24.5%, and 4 layers of LSTM produced an accuracy of 22.8%.
Application Development and Hosting
The web application was developed using Flask (python package), HTML, CSS. Since, the Neural Network developed is heavy in size and computation, the model is hosted locally. It was because pythonanywhere website was not able to execute the prediction. The Github repository can be accessed here.
References
- Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Kudo, Taku, and John Richardson. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.” arXiv preprint arXiv:1808.06226 (2018).
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems 27 (2014): 3104–3112.
- https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
- https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a
- Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. “Effective approaches to attention-based neural machine translation.” arXiv preprint arXiv:1508.04025 (2015).