CL AI IR LGNov 3, 2023

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

Jinrui Yang, Timothy Baldwin, Trevor Cohn

arXiv:2311.01870v121.3131 citationsh-index: 17Has Code

Originality Synthesis-oriented

AI Analysis

This dataset addresses bias analysis in multilingual information retrieval for researchers, though it is incremental as it builds on existing multilingual IR benchmarks.

The authors introduced Multi-EuP, a multilingual dataset from the European Parliament with 22K documents in 24 languages, to study fairness in information retrieval by analyzing language and demographic bias, and demonstrated its use in benchmarking IR systems and exploring tokenization effects.

We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

View on arXiv PDF Code

Similar