CLNov 10, 2023

DeMuX: Data-efficient Multilingual Learning

Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig

arXiv:2311.06379v19.631 citationsh-index: 91Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of data-efficient multilingual learning for NLP practitioners, though it is incremental as it builds on active learning methods.

The paper tackles the problem of fine-tuning pre-trained multilingual models with limited labeled data by introducing DeMuX, a framework that selects the most informative data points to label from unlabeled multilingual data, achieving gains of up to 8-11 F1 points in low-budget settings.

We consider the task of optimally fine-tuning pre-trained multilingual models, given small amounts of unlabelled target data and an annotation budget. In this paper, we introduce DEMUX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations. Our active learning strategies rely upon distance and uncertainty measures to select task-specific neighbors that are most informative to label, given a model. DeMuX outperforms strong baselines in 84% of the test cases, in the zero-shot setting of disjoint source and target language sets (including multilingual target pools), across three models and four tasks. Notably, in low-budget settings (5-100 examples), we observe gains of up to 8-11 F1 points for token-level tasks, and 2-5 F1 for complex tasks. Our code is released here: https://github.com/simran-khanuja/demux.

View on arXiv PDF Code

Similar