AS CL LG SDJul 19, 2021

A baseline model for computationally inexpensive speech recognition for Kazakh using the Coqui STT framework

arXiv:2107.10637v2

Originality Synthesis-oriented

AI Analysis

This work addresses the need for more efficient speech recognition on mobile devices for Kazakh speakers, but it is incremental as it builds on existing datasets and frameworks without major methodological breakthroughs.

The authors tackled the problem of computationally expensive speech recognition for Kazakh by developing a new baseline acoustic model and three language models using the Coqui STT framework, resulting in promising but not yet production-level accuracy that requires further training or vocabulary limitations.

Mobile devices are transforming the way people interact with computers, and speech interfaces to applications are ever more important. Automatic Speech Recognition systems recently published are very accurate, but often require powerful machinery (specialised Graphical Processing Units) for inference, which makes them impractical to run on commodity devices, especially in streaming mode. Impressed by the accuracy of, but dissatisfied with the inference times of the baseline Kazakh ASR model of (Khassanov et al.,2021) when not using a GPU, we trained a new baseline acoustic model (on the same dataset as the aforementioned paper) and three language models for use with the Coqui STT framework. Results look promising, but further epochs of training and parameter sweeping or, alternatively, limiting the vocabulary that the ASR system must support, is needed to reach a production-level accuracy.

View on arXiv PDF

Similar