AS CL SPApr 15, 2024

A Large-Scale Evaluation of Speech Foundation Models

Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi

Meta AIMIT

arXiv:2404.09385v218.869 citationsh-index: 32Has CodeIEEE/ACM Transactions on Audio Speech and Language Processing

Originality Incremental advance

AI Analysis

This work addresses the need for a standardized benchmark in the speech processing community, enabling reproducible and collaborative evaluation of foundation models.

The authors tackled the lack of a systematic evaluation framework for speech foundation models by establishing the SUPERB benchmark, and they verified that the foundation model paradigm is promising for speech, with the best model showing competitive generalizability across most tasks.

The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.

View on arXiv PDF Code

Similar