CVMar 10, 2025

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

arXiv:2503.07906v123 citationsh-index: 4ICLR
Originality Highly original
AI Analysis

This work addresses the underexplored evaluation of detailed image captioning for vision-language models, offering a novel benchmark and metric to enhance accuracy and reduce errors.

The authors tackled the problem of evaluating detailed image captioning by introducing DeCapBench and a new metric, DCScore, which reduces hallucinations and improves performance, achieving superior results over GPT-4o.

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes