CVMay 9, 2021

A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning

arXiv:2105.03826v11 citations
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement for image captioning tasks, potentially benefiting applications in computer vision and natural language processing.

The paper tackled image captioning by proposing a hybrid model that combines Neural Image Caption (NIC) and k-nearest neighbor approaches, achieving a BLEU-4 score of 18.20 on the Flickr8k dataset, which is higher than the individual models' scores of 16.01 and 15.95.

A hybrid model is proposed that integrates two popular image captioning methods to generate a text-based summary describing the contents of the image. The two image captioning models are the Neural Image Caption (NIC) and the k-nearest neighbor approach. These are trained individually on the training set. We extract a set of five features, from the validation set, for evaluating the results of the two models that in turn is used to train a logistic regression classifier. The BLEU-4 scores of the two models are compared for generating the binary-value ground truth for the logistic regression classifier. For the test set, the input images are first passed separately through the two models to generate the individual captions. The five-dimensional feature set extracted from the two models is passed to the logistic regression classifier to take a decision regarding the final caption generated which is the best of two captions generated by the models. Our implementation of the k-nearest neighbor model achieves a BLEU-4 score of 15.95 and the NIC model achieves a BLEU-4 score of 16.01, on the benchmark Flickr8k dataset. The proposed hybrid model is able to achieve a BLEU-4 score of 18.20 proving the validity of our approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes