CLAIJun 2, 2025

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

arXiv:2506.02181v1h-index: 34INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses interpretability for ASR researchers, but it is incremental as it extends prior studies with a modern model and specific phoneme analysis.

The study tackled the problem of unclear acoustic cues in modern ASR models by applying feature attribution to a Conformer-based system, finding that it relies on vowels' full time spans and first two formants, with greater saliency in male speech, and better captures spectral characteristics of sibilant fricatives and release phases in plosives.

Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels' full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes