IRCLDec 17, 2024

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

arXiv:2412.13102v416 citationsh-index: 23Has CodeACL
Originality Incremental advance
AI Analysis

This provides a cost-effective and efficient evaluation tool for information retrieval models in emerging domains, though it is incremental as it builds on existing benchmark concepts with automation.

The paper tackles the limitations of current information retrieval benchmarks by proposing AIR-Bench, an automated benchmark that uses large language models to generate diverse testing data without human intervention, and shows it aligns well with human-labeled data.

Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes