CLJun 18, 2025

Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

arXiv:2506.15239v23 citationsh-index: 12Proceedings of the 29th Conference on Computational Natural Language Learning
Originality Incremental advance
AI Analysis

This addresses the challenge of linguistic variation for NLP applications in low-resource languages like Basque, but it is incremental as it focuses on evaluation and dataset creation.

The paper tackled the problem of evaluating language technologies' capacity to understand Basque and Spanish geographical variants using Natural Language Inference, finding a performance drop, especially in Basque, with encoder-only models struggling with Western Basque.

In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes