CVCLNov 13, 2022

Large-Scale Bidirectional Training for Zero-Shot Image Captioning

arXiv:2211.06774v35 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This addresses the need for more robust and generalizable image captioning models that can handle diverse domains without extensive fine-tuning, though it appears incremental as it builds on existing pretraining strategies.

The paper tackles the problem of generating accurate, detailed captions in image captioning by introducing BITTERS, a framework for zero-shot image captioning through large-scale bidirectional training between image and text, achieving improved performance without task-specific fine-tuning.

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes