CVApr 29, 2020

Image Captioning through Image Transformer

arXiv:2004.14231v2118 citations
AI Analysis

This work addresses the challenge of generating accurate image captions by improving attention mechanisms for computer vision and natural language processing tasks, representing an incremental advancement over existing transformer-based methods.

The authors tackled the problem of adapting the transformer architecture for image captioning by introducing an image transformer that modifies the encoding transformer and uses an implicit decoding transformer to account for spatial relationships between image regions, achieving new state-of-the-art performance on MSCOCO benchmarks.

Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes