SD AIMar 10

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

arXiv:2603.09853v18.5h-index: 2Has Code

Predicted impact top 41% in SD · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the need for better audio comprehension benchmarks in accessibility technology and industrial monitoring, though it is incremental as it builds on existing LALM capabilities.

The paper tackles the lack of benchmarks for audio understanding beyond speech recognition by proposing SCENEBench, a suite targeting four real-world categories like background sound and noise localization, and finds that state-of-the-art models show critical performance gaps, with some tasks below random chance and others achieving high accuracy.

Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.

View on arXiv PDF Code

Similar