CLMar 6, 2024

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

arXiv:2403.03909v232 citationsh-index: 10NAACL-HLT
Originality Incremental advance
AI Analysis

This work addresses the need for more transparent and feature-aware diversity metrics in multilingual NLP, which is incremental as it builds on existing typological approaches.

The paper tackles the problem of measuring linguistic diversity in multilingual NLP datasets by proposing a method that uses linguistic features and an automatic text-based measure to assess diversity against a reference sample, finding that polysynthetic languages are underrepresented in popular datasets.

Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes