AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?
This addresses a gap in multimodal interactions for AI systems, though it is incremental as it builds on existing benchmarks and methods.
The paper tackles the problem of language models lacking auditory knowledge by introducing AuditoryBench++, a benchmark for evaluating auditory reasoning in text-only settings, and AIR-CoT, a method that improves performance, with experiments showing it generally outperforms existing models.
Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.