CL SD ASMay 24, 2022

Adaptive multilingual speech recognition with pretrained models

Ngoc-Quan Pham, Alex Waibel, Jan Niehues

arXiv:2205.12304v12.326 citationsh-index: 81

Originality Synthesis-oriented

AI Analysis

This work addresses speech recognition for multiple languages, especially those with limited data, though it appears incremental as it builds on existing pretrained models.

The researchers tackled multilingual speech recognition by combining pretrained audio (wav2vec 2.0) and text (MBART50) models with adaptive weight techniques, achieving a 44% improvement over purely supervised learning on CommonVoice and Europarl datasets.

Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for text, together with the adaptive weight techniques to massively improve the recognition quality on the public datasets containing CommonVoice and Europarl. Overall, we noticed an 44% improvement over purely supervised learning, and more importantly, each technique provides a different reinforcement in different languages. We also explore other possibilities to potentially obtain the best model by slightly adding either depth or relative attention to the architecture.

View on arXiv PDF

Similar