Universal-2-TF: Robust All-Neural Text Formatting for ASR
This work addresses the need for robust text formatting to enhance ASR usability in commercial settings, representing an incremental improvement over traditional methods.
The paper tackles the problem of text formatting for ASR systems by introducing an all-neural model that handles punctuation restoration, truecasing, and inverse text normalization, achieving superior performance in accuracy, computational efficiency, and perceptual quality.
This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.