VAIS ASR: Building a conversational speech recognition system using language model combination
This addresses the challenge of building robust ASR systems for conversational speech in noisy settings, which is an incremental improvement over existing methods.
The paper tackled the problem of improving automatic speech recognition (ASR) for conversational speech in noisy environments by combining language models to leverage both large writing-style text and small conversation text data, achieving 4.85% WER on VLSP 2018 and 15.09% WER on VLSP 2019 datasets.
Automatic Speech Recognition (ASR) systems have been evolving quickly and reaching human parity in certain cases. The systems usually perform pretty well on reading style and clean speech, however, most of the available systems suffer from situation where the speaking style is conversation and in noisy environments. It is not straight-forward to tackle such problems due to difficulties in data collection for both speech and text. In this paper, we attempt to mitigate the problems using language models combination techniques that allows us to utilize both large amount of writing style text and small number of conversation text data. Evaluation on the VLSP 2019 ASR challenges showed that our system achieved 4.85% WER on the VLSP 2018 and 15.09% WER on the VLSP 2019 data sets.