Data-driven Discovery with Large Generative Models
It addresses the problem of accelerating scientific discovery for researchers by leveraging data without new experiments, but it is incremental as it builds on existing LGM capabilities.
The paper proposes using large generative models (LGMs) like GPT-4 to automate end-to-end data-driven discovery from existing datasets, demonstrating a proof-of-concept called DATAVOYAGER that fulfills some desiderata but faces challenges in achieving reliable systems.
With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.