Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark
This work addresses the need for reliable evidence evaluation in biomedical question-answering systems, though it is incremental as it builds on existing risk-of-bias frameworks.
The authors tackled the problem of assessing methodological quality in biomedical literature by introducing the RoBBR benchmark, derived from over 500 studies, which includes three tasks for evaluating risk of bias and a human-validated annotation pipeline, showing that large language models' reasoning and retrieval capabilities affect their assessment effectiveness.
Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence from different studies, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. Derived from over 500 biomedical studies, the three benchmark tasks encompass expert reviewers' judgments of studies' research methodologies, including the assessments of risk of bias within these studies. The benchmark contains a human-validated annotation pipeline for fine-grained alignment of reviewers' judgments with research paper sentences. Our analyses show that large language models' reasoning and retrieval capabilities impact their effectiveness with risk-of-bias assessment. The dataset is available at https://github.com/RoBBR-Benchmark/RoBBR.