Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
This work addresses the problem of language model bias for minority language preservation, though it is incremental as it builds on prior evidence of such biases.
The paper investigated whether pretrained language models can identify loanwords across 10 languages, finding that they perform poorly in distinguishing loanwords from native vocabulary, corroborating biases in NLP systems.
Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.