CL AIFeb 16

BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique

arXiv:2602.14488v10.6

Originality Incremental advance

AI Analysis

This work addresses dataset scarcity for low-resource IR, but it is incremental as it builds on existing LLM and translation methods with specific empirical insights.

This study tackled the problem of constructing multilingual datasets for low-resource information retrieval by developing a BETA-labeling framework using multiple LLMs and evaluating cross-lingual dataset reuse via machine translation, revealing substantial variation across languages that affects reliability.

IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.

View on arXiv PDF

Similar