Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information
This work addresses language interference in multilingual speech recognition systems, offering incremental improvements for ASR applications.
The paper tackles language interference in self-supervised multilingual speech pre-training by introducing auxiliary language information techniques, resulting in a 14.3% relative gain over the standard XLSR model and a 19.8% gain over a no pre-training baseline on a 16-language ASR task.
Multilingual end-to-end models have shown great improvement over monolingual systems. With the development of pre-training methods on speech, self-supervised multilingual speech representation learning like XLSR has shown success in improving the performance of multilingual automatic speech recognition (ASR). However, similar to the supervised learning, multilingual pre-training may also suffer from language interference and further affect the application of multilingual system. In this paper, we introduce several techniques for improving self-supervised multilingual pre-training by leveraging auxiliary language information, including the language adversarial training, language embedding and language adaptive training during the pre-training stage. We conduct experiments on a multilingual ASR task consisting of 16 languages. Our experimental results demonstrate 14.3% relative gain over the standard XLSR model, and 19.8% relative gain over the no pre-training multilingual model.