Phonetic Error Analysis of Raw Waveform Acoustic Models

arXiv:2606.0703020.6
Originality Incremental advance
AI Analysis

For speech recognition researchers, this work provides a detailed phonetic error analysis of raw waveform models, revealing that confusion patterns are similar to filterbank systems and that transfer learning benefits consonants more than vowels.

The paper analyzes error patterns of raw waveform acoustic models for TIMIT phone recognition, achieving 13.9%/15.3% PER on Dev/Test, the best for raw waveform models, and shows that transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing a Filterbank baseline.

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes