CLLGSDASFeb 10, 2023

PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR Error Correction

arXiv:2302.05040v28 citationsh-index: 19
AI Analysis

This addresses the need for efficient error correction in industrial ASR systems with low latency requirements, representing a strong specific gain rather than a foundational advancement.

The paper tackles the problem of speech-to-text errors from ASR systems by proposing PATCorrect, a non-autoregressive phoneme-augmented transformer for error correction, which reduces word error rate by 11.62% compared to 9.46% for other methods and achieves inference latency in tens of milliseconds.

Speech-to-text errors made by automatic speech recognition (ASR) systems negatively impact downstream models. Error correction models as a post-processing text editing method have been recently developed for refining the ASR outputs. However, efficient models that meet the low latency requirements of industrial grade production systems have not been well studied. We propose PATCorrect-a novel non-autoregressive (NAR) approach based on multi-modal fusion leveraging representations from both text and phoneme modalities, to reduce word error rate (WER) and perform robustly with varying input transcription quality. We demonstrate that PATCorrect consistently outperforms state-of-the-art NAR method on English corpus across different upstream ASR systems, with an overall 11.62% WER reduction (WERR) compared to 9.46% WERR achieved by other methods using text only modality. Besides, its inference latency is at tens of milliseconds, making it ideal for systems with low latency requirements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes