CVCLJan 30, 2022

A Frustratingly Simple Approach for End-to-End Image Captioning

arXiv:2201.12723v320 citations
Originality Incremental advance
AI Analysis

This addresses the issue of requiring fine-grained object annotations and error propagation in image captioning for researchers and practitioners, though it is incremental as it builds on pre-trained models like CLIP-ViT and GPT2.

The paper tackles the problem of image captioning by proposing an end-to-end framework called VC-GPT that eliminates the need for an extra object detector, achieving state-of-the-art or second-best performance on benchmarks like MSCOCO, Flickr30k, and NoCaps.

Image Captioning is a fundamental task to join vision and language, concerning about cross-modal understanding and text generation. Recent years witness the emerging attention on image captioning. Most of existing works follow a traditional two-stage training paradigm. Before training the captioning models, an extra object detector is utilized to recognize the objects in the image at first. However, they require sizeable datasets with fine-grained object annotation for training the object detector, which is a daunting task. In addition, the errors of the object detectors are easy to propagate to the following captioning models, degenerating models' performance. To alleviate such defects, we propose a frustratingly simple but highly effective end-to-end image captioning framework, Visual Conditioned GPT (VC-GPT), by connecting the pre-trained visual encoder (CLIP-ViT) and language decoder (GPT2). Different from the vanilla connection method that directly inserts the cross-attention modules into GPT2, we come up with a self-ensemble cross-modal fusion mechanism that comprehensively considers both the single- and cross-modal knowledge. As a result, we do not need extra object detectors for model training. Experimental results conducted on three popular image captioning benchmarks (MSCOCO, Flickr30k and NoCaps) demonstrate that our VC-GPT achieves either the best or the second-best performance across all evaluation metrics over extensive baseline systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes