CLLGSDASApr 4, 2024

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

arXiv:2404.04295v1h-index: 16Has CodeICASSP
Originality Incremental advance
AI Analysis

This work addresses error propagation in automatic speech recognition for languages like Mandarin and Korean, offering an incremental improvement over conventional Transducers.

The paper tackles speech recognition errors by proposing Transducers with Pronunciation-aware Embeddings (PET), which incorporate shared components for tokens with similar pronunciations, and shows that PET models consistently improve accuracy in Mandarin Chinese and Korean datasets while mitigating error chain reactions by reducing the likelihood of subsequent errors after an initial one.

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes