CLFeb 20, 2024

Understanding the effects of language-specific class imbalance in multilingual fine-tuning

arXiv:2402.13016v1105 citationsh-index: 6Findings

Originality Incremental advance

AI Analysis

This addresses a specific data imbalance issue in multilingual NLP, but the solution is incremental as it adapts existing class weighting methods.

The study tackled the problem of language-specific class imbalance in multilingual fine-tuning, showing that it leads to worse performance, increased language separation in latent space, and promotion of uninformative features, and mitigated these effects by calculating class weights separately per language.

We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.

View on arXiv PDF

Similar