CVAICLMMNov 24, 2021

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

arXiv:2111.12727v315 citations
Originality Incremental advance
AI Analysis

This addresses the problem of low-quality captions from noisy web data for image captioning tasks, offering an incremental improvement by combining data sources more effectively.

The paper tackles generating fluent image captions by training on mixed human-annotated and web-collected datasets, proposing a model that separates semantics and style to replicate human-like captions, achieving consistent outperformance over baselines and state-of-the-art methods on datasets like COCO.

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes