FePySR: A Neural Feature Extraction Framework for Efficient and Scalable Symbolic Regression
For researchers in scientific discovery needing reliable symbolic regression, FePySR offers a scalable improvement over existing methods, particularly for complex or noisy data.
FePySR introduces a two-stage symbolic regression framework that first extracts candidate features via a heterogeneous neural network, then performs equation search with PySR. It achieves higher equation recovery rates than state-of-the-art methods, recovering 36 of 75 complex equations and succeeding in 24 of 100 biological ODE tests where PySR recovered none.
A fundamental challenge in symbolic regression (SR) is efficiently recovering complex mathematical expressions from observational data. Although this problem is NP-hard, many expressions of practical interest decompose naturally into combinations of nonlinear feature modules, concentrating structural complexity into a small number of reusable components. Here, we introduce FePySR, a two-stage framework that reduces the SR search space by extracting valid features prior to equation search. FePySR first employs a heterogeneous neural network to constrain observational data to a set of candidate expressions, then performs structural optimization within this refined expression space using PySR. Across five standard benchmarks, FePySR outperforms state-of-the-art methods by achieving higher equation recovery rates. On a set of 75 highly complex synthesized equations, FePySR recovers 36 equations, while producing substantially smaller mean squared errors on the remaining unrecovered cases, with reduced computation time compared to PySR. FePySR's first stage also maintains consistent performance under varying numbers of selected top features and increasing levels of noise in the observational data. Applied to ordinary differential equations governing biological systems, FePySR successfully identifies governing equations in 24 out of 100 tests where PySR recovers none. Taken together, FePySR is a generalizable framework that can enhance the SR solvers, enabling the efficient and reliable recovery of symbolic expressions across scientific domains.