CLApr 15, 2025

CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine

Hanmeng Zhong, Linqing Chen, Wentao Wu, Weilei Wang

arXiv:2504.12342v21 citationsh-index: 5Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This addresses the need for reliable evaluation of curation in retrieval-augmented LLMs for biomedical applications, though it is incremental as it focuses on benchmarking rather than new methods.

The authors tackled the problem of evaluating the curation ability of retrieval-augmented LLMs in biomedicine by introducing the CRAB benchmark, which uses a novel citation-based metric to quantify performance and reveals significant discrepancies among mainstream models.

Recent development in Retrieval-Augmented Large Language Models (LLMs) have shown great promise in biomedical applications. How ever, a critical gap persists in reliably evaluating their curation ability the process by which models select and integrate relevant references while filtering out noise. To address this, we introduce the benchmark for Curation of Retrieval-Augmented LLMs in Biomedicine (CRAB), the first multilingual benchmark tailored for evaluating the biomedical curation of retrieval-augmented LLMs, available in English, French, German and Chinese. By incorporating a novel citation-based evaluation metric, CRAB quantifies the curation performance of retrieval-augmented LLMs in biomedicine. Experimental results reveal significant discrepancies in the curation performance of mainstream LLMs, underscoring the urgent need to improve it in the domain of biomedicine. Our dataset is available at https://huggingface.co/datasets/zhm0/CRAB.

View on arXiv PDF

Similar