CVSDDec 10, 2025

VABench: A Comprehensive Benchmark for Audio-Video Generation

arXiv:2512.09299v111 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This addresses the problem of inadequate benchmarking for audio-video generation models, which is incremental as it builds on existing video generation benchmarks by adding audio-specific evaluations.

The authors tackled the lack of convincing evaluations for audio-video generation by introducing VABench, a comprehensive benchmark framework that systematically assesses synchronous audio-video generation across multiple tasks and dimensions, establishing a new standard for the field.

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes