ASAICLOct 29, 2025

Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

arXiv:2510.25577v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the need for more reliable evaluation methods in speech AI to capture paralinguistic nuances, which is important for improving model robustness in applications like affective computing, but it is incremental as it focuses on a specific under-explored dimension.

The paper tackled the problem of evaluating speech foundation models' sensitivity to voice quality variations, such as creaky and breathy voice, by introducing a new parallel dataset and probing through open-ended generation and emotion recognition tasks, finding that models show inconsistent behaviors across different phonation inputs.

Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes