Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text Dataset
This work addresses speech recognition for Russian speakers by comparing methods on a large open dataset, but it is incremental as it applies existing techniques to new data.
The paper tackled automatic speech recognition for Russian using the OpenSTT dataset, evaluating end-to-end models against a hybrid system, with the best end-to-end model achieving word error rates of 34.8%, 19.1%, and 18.1% on phone calls, YouTube, and books validation sets, respectively.
This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybridASR system demonstrates 33.5%, 20.9%, and 18.6% WER.