LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization
This addresses the challenge of unseen language generalization for NLP practitioners working with low-resource languages, though it appears incremental as it builds on existing regularization techniques with added linguistic features.
The paper tackles the problem of pretrained language models failing on unseen languages by introducing LinguAlchemy, a regularization method that incorporates typological, geographical, and phylogenetic features to align model representations, resulting in significant performance improvements on low-resource languages in tasks like intent classification and semantic relatedness compared to fully finetuned models.
Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that incorporates various linguistic information covering typological, geographical, and phylogenetic features to align PLMs representation to the corresponding linguistic information on each language. Our LinguAlchemy significantly improves the performance of mBERT and XLM-R on low-resource languages in multiple downstream tasks such as intent classification, news classification, and semantic relatedness compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search.