Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
This work addresses the need for accurate word timings in applications like subtitling and pronunciation training, representing an incremental improvement over existing methods.
The paper tackles the problem of improving word timing accuracy in end-to-end automatic speech recognition by enhancing the frame-level classifier with non-peaky CTC and feature combination, achieving up to 95.68% accuracy on an internal Chinese corpus and surpassing a previous E2E approach by 4.80-8.02% across 7 languages.
End-to-end (E2E) systems have shown comparable performance to hybrid systems for automatic speech recognition (ASR). Word timings, as a by-product of ASR, are essential in many applications, especially for subtitling and computer-aided pronunciation training. In this paper, we improve the frame-level classifier for word timings in E2E system by introducing label priors in connectionist temporal classification (CTC) loss, which is adopted from prior works, and combining low-level Mel-scale filter banks with high-level ASR encoder output as input feature. On the internal Chinese corpus, the proposed method achieves 95.68%/94.18% compared to the hybrid system 93.0%/90.22% on the word timing accuracy metrics. It also surpass a previous E2E approach with an absolute increase of 4.80%/8.02% on the metrics on 7 languages. In addition, we further improve word timing accuracy by delaying CTC peaks with frame-wise knowledge distillation, though only experimenting on LibriSpeech.