CVAICLMMAug 26, 2024

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

arXiv:2408.14547v18 citationsh-index: 34Has Code
Originality Incremental advance
AI Analysis

This addresses a training bottleneck for image captioning models, offering a more stable and human-aligned approach, though it is incremental as it builds on existing CLIP-based methods.

The paper tackles the instability and poor descriptive capability of conventional image captioning training when optimizing modern metrics like CLIP-Score, proposing DiCO, a new paradigm that improves stability, caption quality, and alignment with human preferences while maintaining competitive performance in traditional metrics.

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes