WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset
This addresses the problem of limited data availability for academic researchers in information retrieval, enabling better reproducibility and advancement, though it is incremental as it builds on existing Wikipedia resources.
The authors tackled the lack of large-scale annotated datasets for deep learning in information retrieval by developing WIKIR, an open-source toolkit that automatically builds such datasets from Wikipedia, resulting in two datasets with 78,628 queries and over 3 million query-document pairs.
Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR78k and wikIRS78k: two large-scale publicly available datasets that both contain 78,628 queries and 3,060,191 (query, relevant documents) pairs.