CVAILGMar 12, 2025

DAVE: Diagnostic benchmark for Audio Visual Evaluation

arXiv:2503.09321v14 citationsh-index: 76Has Code
Originality Incremental advance
AI Analysis

This provides a standardized diagnostic framework for researchers in audio-visual understanding to identify and address specific model weaknesses, though it is incremental as it builds on existing benchmark efforts.

The authors tackled the problem of strong visual bias and conflated error sources in audio-visual benchmarks by introducing DAVE, a diagnostic dataset that ensures both modalities are necessary and decouples evaluation into subcategories, revealing specific failure modes in state-of-the-art models.

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes