CLAIApr 10, 2024

XNLIeu: a dataset for cross-lingual NLI in Basque

arXiv:2404.06996v132 citationsh-index: 17Has CodeNAACL
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of low-resource language support in NLU for Basque, but it is incremental as it extends an existing benchmark with a new language.

The authors tackled the lack of cross-lingual Natural Language Inference (NLI) resources for Basque by creating XNLIeu, a dataset developed via machine translation and manual post-edition, and found that translate-train strategies yield better results, though gains are lower on natively built datasets.

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes