DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience
This work addresses the challenge of scientific equation discovery for researchers in fields like physics, chemistry, biology, and materials science, offering an incremental improvement over existing LLM-based approaches.
The paper tackled the problem of symbolic regression for discovering interpretable mathematical expressions from data by proposing DrSR, a framework that combines data-driven insight with reflective learning to enhance robustness and discovery capability, resulting in substantial improvements in valid equation rate and outperforming classical and recent LLM-based methods in accuracy, generalization, and search efficiency across interdisciplinary datasets.
Symbolic regression is a fundamental tool for discovering interpretable mathematical expressions from data, with broad applications across scientific and engineering domains. Recently, large language models (LLMs) have demonstrated strong performance in this task, leveraging embedded scientific priors and reasoning capabilities to surpass traditional methods. However, existing LLM-based approaches, such as LLM-SR, often over-rely on internal priors, lacking explicit data understanding and systematic reflection during equation generation. To address these limitations, we propose DrSR (Dual Reasoning Symbolic Regression), a framework that combines data-driven insight with reflective learning to enhance both robustness and discovery capability. Specifically, DrSR guides LLMs to analyze structural relationships (e.g., monotonicity, nonlinearity, and correlation) within the data to generate structured descriptions. Simultaneously, it monitors equation performance and establishes a feedback loop to refine subsequent generations. By integrating data understanding and generation reflection in a closed loop, DrSR enables more efficient exploration of the symbolic expression space. Experiments across interdisciplinary datasets in physics, chemistry, biology, and materials science demonstrate that DrSR substantially improves the valid equation rate and consistently outperforms both classical and recent LLM-based methods in terms of accuracy, generalization, and search efficiency. These results underscore its potential for scientific equation discovery.