SD AI CLNov 28, 2025

ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

Šimon Sedláček, Sara Barahona, Bolaji Yusuf, Laura Herrera-Alarcón, Santosh Kesiraju, Cecilia Bolaños, Alicia Lozano-Diez, Sathvik Udupa, Fernando López, Allison Ferner, Ramani Duraiswami, Jan Černocký

arXiv:2512.09066v14.0

Originality Incremental advance

AI Analysis

This addresses the problem of subjective evaluation in audio question answering for researchers and practitioners, offering a more nuanced alternative to traditional metrics.

The paper tackles the challenge of evaluating open-ended responses from large audio language models by presenting ORCA, a framework that models human judgment variability using Beta distributions to predict correctness and uncertainty, achieving 0.91 Spearman correlation with human judgments on audio QA benchmarks.

Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.

View on arXiv PDF

Similar