CL SD ASJun 15, 2022

Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech

Jan Lehečka, Jan Švec, Aleš Pražák, Josef V. Psutka

arXiv:2206.07627v11.417 citationsh-index: 23

Originality Synthesis-oriented

AI Analysis

This work addresses speech recognition for Czech speakers, showing incremental improvements by leveraging large datasets and existing methods.

The authors tackled automatic speech recognition for Czech by pretraining monolingual audio transformers on over 80,000 hours of unlabeled speech and fine-tuning with nearly 6,000 hours of transcribed data, achieving results that compete with state-of-the-art LVCSR systems and demonstrating zero-shot learning capabilities.

In this paper, we present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech, and subsequently fine-tuning the model on automatic speech recognition tasks using a combination of in-domain data and almost 6 thousand hours of out-of-domain transcribed speech. We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets (CommonVoice and VoxPopuli) and one extremely challenging dataset from the MALACH project. Our results show that monolingual Wav2Vec 2.0 models are robust ASR systems, which can take advantage of large labeled and unlabeled datasets and successfully compete with state-of-the-art LVCSR systems. Moreover, Wav2Vec models proved to be good zero-shot learners when no training data are available for the target ASR task.

View on arXiv PDF

Similar