CL AIJan 22

Generating Literature-Driven Scientific Theories at Scale

Peter Jansen, Peter Clark, Doug Downey, Daniel S. Weld

arXiv:2601.16282v12.62 citationsh-index: 5

Originality Highly original

AI Analysis

This addresses the underexplored area of automated theory building in scientific discovery, offering a scalable approach for researchers in fields like computational science and AI.

The paper tackles the problem of automated scientific theory generation from large literature corpora, showing that a literature-supported method produces theories that better match existing evidence and predict future results compared to parametric LLM-based approaches, with experiments involving 13.7k source papers and 2.9k synthesized theories.

Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers

View on arXiv PDF

Similar