CLApr 23

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

arXiv:2604.2176653.8
AI Analysis

For researchers in audio understanding and AI, this benchmark exposes the limitations of current models in robust auditory reasoning, providing a more challenging evaluation than existing datasets.

The paper introduces AUDITA, a large-scale audio QA benchmark designed to require genuine reasoning beyond surface-level acoustic cues. Humans achieve 32.13% accuracy, while state-of-the-art models score below 8.86%, highlighting a significant gap.

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes