CVNov 4, 2022

OSIC: A New One-Stage Image Captioner Coined

arXiv:2211.02321v16 citationsh-index: 30
Originality Incremental advance
AI Analysis

This addresses the problem of suboptimal feature representation in image captioning for AI and computer vision researchers, though it is incremental as it builds on existing transformer-based methods.

The paper tackles the performance gap in image captioning caused by two-stage models by proposing a one-stage captioner (OSIC) that directly transforms images into text, achieving superior results on the MS-COCO benchmark.

Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes