CLAIIRLGNov 3, 2023

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

arXiv:2311.01870v1131 citationsh-index: 17
Originality Synthesis-oriented
AI Analysis

This dataset addresses bias analysis in multilingual information retrieval for researchers, though it is incremental as it builds on existing multilingual IR benchmarks.

The authors introduced Multi-EuP, a multilingual dataset from the European Parliament with 22K documents in 24 languages, to study fairness in information retrieval by analyzing language and demographic bias, and demonstrated its use in benchmarking IR systems and exploring tokenization effects.

We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes