CVAICLMay 28, 2025

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Peking U
arXiv:2505.22613v15 citationsh-index: 19Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the issue of generating high-quality training datasets for multimodal tasks, though it is incremental as it builds on existing MLLM-based methods.

The paper tackles the problem of inaccuracies and incompleteness in image recaptioning by proposing RICO, a framework that refines captions through visual reconstruction, resulting in approximately 10% improvements on CapsBench and CompreCap benchmarks.

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes