Graph Connectionist Temporal Classification for Phoneme Recognition
This addresses a specific bottleneck in phoneme recognition for speech processing by enabling better training from noisy supervision, though it is incremental as it builds on existing methods.
The paper tackled the problem of training Automatic Phoneme Recognition systems with ambiguous pronunciations from Grapheme-to-Phoneme systems by adapting Graph Temporal Classification to handle multiple phoneme sequences, resulting in improved phoneme error rates on English and Dutch datasets.
Automatic Phoneme Recognition (APR) systems are often trained using pseudo phoneme-level annotations generated from text through Grapheme-to-Phoneme (G2P) systems. These G2P systems frequently output multiple possible pronunciations per word, but the standard Connectionist Temporal Classification (CTC) loss cannot account for such ambiguity during training. In this work, we adapt Graph Temporal Classification (GTC) to the APR setting. GTC enables training from a graph of alternative phoneme sequences, allowing the model to consider multiple pronunciations per word as valid supervision. Our experiments on English and Dutch data sets show that incorporating multiple pronunciations per word into the training loss consistently improves phoneme error rates compared to a baseline trained with CTC. These results suggest that integrating pronunciation variation into the loss function is a promising strategy for training APR systems from noisy G2P-based supervision.