Soft Language Clustering for Multilingual Model Pre-training
This addresses performance gaps in multilingual AI for languages with limited data or distant typology, representing an incremental improvement over existing methods.
The paper tackled the problem of multilingual pre-trained language models underperforming when target languages are typologically distant or data is limited, by proposing XLM-P which uses contextual prompts to encode instances conditionally, resulting in consistent performance improvements on XTREME tasks and substantial advantages for low-resource languages.
Multilingual pre-trained language models have demonstrated impressive (zero-shot) cross-lingual transfer abilities, however, their performance is hindered when the target language has distant typology from source languages or when pre-training data is limited in size. In this paper, we propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods. On the tasks of XTREME including text classification, sequence labeling, question answering, and sentence retrieval, both base- and large-size language models pre-trained with our proposed method exhibit consistent performance improvement. Furthermore, it provides substantial advantages for low-resource languages in unsupervised sentence retrieval and for target languages that differ greatly from the source language in cross-lingual transfer.