CLLGSDASJul 18, 2024

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR

arXiv:2407.13142v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for efficient punctuation and casing prediction in on-device streaming ASR systems, offering a practical solution for real-time applications.

The paper tackles the problem of punctuation and word casing prediction for on-device streaming ASR by proposing a light-weight CNN-BiLSTM model, achieving a 9% relative F1-score improvement over non-Transformer models and comparable results to Transformer models with 1/40 the size and 2.5x faster inference.

Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes