CVLGOct 11, 2023

A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

arXiv:2310.07252v14 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of generating textual descriptions from images for applications in computer vision and NLP, but it is incremental as it builds on existing attention and pre-trained models.

The paper tackled image caption generation by proposing a deep neural framework using GRU-based attention with pre-trained CNNs, achieving competitive scores on MSCOCO and Flickr30k datasets compared to state-of-the-art methods.

Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes