Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models
This work addresses model selection challenges in symbolic regression for machine learning practitioners, offering a pragmatic but incremental improvement over existing methods.
The paper tackles the problem of selecting symbolic regression models that generalize well by introducing a data-informed complexity metric based on Hessian rank approximation and intrinsic dimensionality estimation, achieving improved generalizability without user-defined parameter bias.
Choosing models from a well-fitted evolved population that generalizes beyond training data is difficult. We introduce a pragmatic method to estimate model complexity using Hessian rank for post-processing selection. Complexity is approximated by averaging the model output Hessian rank across a few points (N=3), offering efficient and accurate rank estimates. This method aligns model selection with input data complexity, calculated using intrinsic dimensionality (ID) estimators. Using the StackGP system, we develop symbolic regression models for the Penn Machine Learning Benchmark and employ twelve scikit-dimension library methods to estimate ID, aligning model expressiveness with dataset ID. Our data-informed complexity metric finds the ideal complexity window, balancing model expressiveness and accuracy, enhancing generalizability without bias common in methods reliant on user-defined parameters, such as parsimony pressure in weight selection.