CLASNov 8, 2024

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

CMUMIT
arXiv:2411.05361v283 citationsh-index: 56Has Code
Originality Synthesis-oriented
AI Analysis

This benchmark addresses the problem of evaluating spoken language models for researchers and developers, but it is incremental as it builds upon a previous version by adding new tasks and capabilities.

The authors tackled the lack of a comprehensive evaluation benchmark for instruction-based universal speech models by introducing Dynamic-SUPERB Phase-2, an open and evolving benchmark that expands to 180 tasks, making it the largest for speech and audio evaluation, with results showing no model performed well universally, such as SALMONN-13B excelling in English ASR and Qwen2-Audio-7B-Instruct achieving high accuracy in emotion recognition.

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes