CLCVJun 9, 2025

Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

arXiv:2506.07631v11 citationsh-index: 37
Originality Incremental advance
AI Analysis

This work addresses the problem of fine-grained evaluation for VLM-generated detailed captions, offering a benchmark and tools to improve image understanding, though it is incremental in building on existing methods for factuality assessment.

The paper tackles the challenge of evaluating factual accuracy in detailed image captions generated by Vision-Language Models (VLMs) by introducing DOCCI-Critique, a benchmark with 1,400 captions and over 10,216 human annotations, and VNLI-Critique, a model for automated factuality classification and critique generation, achieving strong results such as 0.98 Spearman correlation for VLM rankings and a 46% gain in caption factuality.

Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes