Word Level Timestamp Generation for Automatic Speech Recognition and Translation
This work addresses the need for precise timestamping in downstream tasks like speech content retrieval and subtitles for users of ASR and AST systems, representing an incremental improvement by integrating timestamp prediction into an existing model.
The paper tackles the problem of generating accurate word-level timestamps for automatic speech recognition and translation by introducing a data-driven method that trains the Canary model to predict timestamps directly, achieving precision and recall rates of 80-90% and errors of 20-120 ms in ASR and around 200 ms in AST with minimal WER degradation.
We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.