CVMay 29, 2025

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

arXiv:2505.23484v15 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of inadequate fine-grained caption evaluation for researchers and developers in text-to-video generation, though it is incremental as it builds on existing benchmark efforts.

The authors tackled the lack of fine-grained evaluation for video captions in text-to-video generation by introducing VCapsBench, a large-scale benchmark with 5,677 videos and 109,796 question-answer pairs across 21 dimensions, which provides actionable insights for caption optimization.

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes