NELGJul 31, 2023

Active Learning in Genetic Programming: Guiding Efficient Data Collection for Symbolic Regression

arXiv:2308.00672v14 citationsh-index: 52
Originality Incremental advance
AI Analysis

This work addresses data efficiency in symbolic regression for researchers and practitioners, though it is incremental as it adapts existing active learning concepts to genetic programming.

This paper tackles the problem of efficiently selecting training data for symbolic regression using genetic programming by developing an active learning approach that combines uncertainty and diversity metrics. The results show that differential entropy performed best as an uncertainty metric, correlation outperformed Euclidean distance for diversity, and a Pareto optimization method effectively balanced both criteria to guide data selection.

This paper examines various methods of computing uncertainty and diversity for active learning in genetic programming. We found that the model population in genetic programming can be exploited to select informative training data points by using a model ensemble combined with an uncertainty metric. We explored several uncertainty metrics and found that differential entropy performed the best. We also compared two data diversity metrics and found that correlation as a diversity metric performs better than minimum Euclidean distance, although there are some drawbacks that prevent correlation from being used on all problems. Finally, we combined uncertainty and diversity using a Pareto optimization approach to allow both to be considered in a balanced way to guide the selection of informative and unique data points for training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes