ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
This addresses the problem of assessing LLMs' ability to assist in scientific research for researchers and AI developers, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the lack of a benchmark for evaluating LLMs in scientific discovery by introducing ResearchBench, a large-scale benchmark with tasks like inspiration retrieval and hypothesis composition across 12 disciplines, using papers from 2024 to avoid data contamination. Their evaluation showed LLMs perform well in retrieving inspirations, suggesting they can generate innovative hypotheses at scale with minimal human intervention.
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.