Bogdan Burlacu

LG
h-index24
13papers
584citations
Novelty39%
AI Score30

13 Papers

CONov 27, 2023
A precise symbolic emulator of the linear matter power spectrum

Deaglan J. Bartlett, Lukas Kammerer, Gabriel Kronberger et al.

Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and $σ_8$. We learn the ratio between an existing low-accuracy fitting function for $P(k)$ and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between $k = 9\times10^{-3} - 9 \, h{\rm \, Mpc^{-1}}$ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. Our analytic approximation is 950 times faster to evaluate than camb and 36 times faster than the neural network based matter power spectrum emulator BACCO. We also provide a simple analytic approximation for $σ_8$ with a similar accuracy, with a root mean squared fractional error of just 0.1% when evaluated across the same range of cosmologies. This function is easily invertible to obtain $A_{\rm s}$ as a function of $σ_8$ and the other cosmological parameters, if preferred. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated.

MTRL-SCIJun 13, 2022
Symbolic Regression in Materials Science: Discovering Interatomic Potentials from Data

Bogdan Burlacu, Michael Kommenda, Gabriel Kronberger et al.

Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost. Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials. This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data.

NEJul 29, 2021Code
Contemporary Symbolic Regression Methods and their Relative Performance

William La Cava, Patryk Orzechowski, Bogdan Burlacu et al.

Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems, including physics equations and systems of ordinary differential equations. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that deep learning and genetic algorithm-based approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.

LGMay 6, 2025
Call for Action: towards the next generation of symbolic regression benchmark

Guilherme S. Imai Aldeia, Hengzhe Zhang, Geoffrey Bomarito et al.

Symbolic Regression (SR) is a powerful technique for discovering interpretable mathematical expressions. However, benchmarking SR methods remains challenging due to the diversity of algorithms, datasets, and evaluation criteria. In this work, we present an updated version of SRBench. Our benchmark expands the previous one by nearly doubling the number of evaluated methods, refining evaluation metrics, and using improved visualizations of the results to understand the performances. Additionally, we analyze trade-offs between model complexity, accuracy, and energy consumption. Our results show that no single algorithm dominates across all datasets. We propose a call for action from SR community in maintaining and evolving SRBench as a living benchmark that reflects the state-of-the-art in symbolic regression, by standardizing hyperparameter tuning, execution constraints, and computational resource allocation. We also propose deprecation criteria to maintain the benchmark's relevance and discuss best practices for improving SR algorithms, such as adaptive hyperparameter tuning and energy-efficient implementations.

LGFeb 6, 2024
Multi-View Symbolic Regression

Etienne Russeil, Fabrício Olivetti de França, Konstantin Malanchev et al.

Symbolic regression (SR) searches for analytical expressions representing the relationship between a set of explanatory and response variables. Current SR methods assume a single dataset extracted from a single experiment. Nevertheless, frequently, the researcher is confronted with multiple sets of results obtained from experiments conducted with different setups. Traditional SR methods may fail to find the underlying expression since the parameters of each experiment can be different. In this work we present Multi-View Symbolic Regression (MvSR), which takes into account multiple datasets simultaneously, mimicking experimental environments, and outputs a general parametric solution. This approach fits the evaluated expression to each independent dataset and returns a parametric family of functions f(x; theta) simultaneously capable of accurately fitting all datasets. We demonstrate the effectiveness of MvSR using data generated from known expressions, as well as real-world data from astronomy, chemistry and economy, for which an a priori analytical expression is not available. Results show that MvSR obtains the correct expression more frequently and is robust to hyperparameters change. In real-world data, it is able to grasp the group behavior, recovering known expressions from the literature as well as promising alternatives, thus enabling the use of SR to a large range of experimental scenarios.

LGSep 28, 2021
Cluster Analysis of a Symbolic Regression Search Space

Gabriel Kronberger, Lukas Kammerer, Bogdan Burlacu et al.

In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.

LGSep 28, 2021
Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication

Lukas Kammerer, Gabriel Kronberger, Bogdan Burlacu et al.

Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.

LGSep 1, 2021
Optimization Networks for Integrated Machine Learning

Michael Kommenda, Johannes Karder, Andreas Beham et al.

Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks.

LGAug 24, 2021
On the Effectiveness of Genetic Operations in Symbolic Regression

Bogdan Burlacu, Michael Affenzeller, Michael Kommenda

This paper describes a methodology for analyzing the evolutionary dynamics of genetic programming (GP) using genealogical information, diversity measures and information about the fitness variation from parent to offspring. We introduce a new subtree tracing approach for identifying the origins of genes in the structure of individuals, and we show that only a small fraction of ancestor individuals are responsible for the evolvement of the best solutions in the population.

LGJul 22, 2021
Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression

Bogdan Burlacu, Lukas Kammerer, Michael Affenzeller et al.

We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems.

NEJul 20, 2021
Using Shape Constraints for Improving Symbolic Regression Models

Christian Haider, Fabricio Olivetti de França, Bogdan Burlacu et al.

We describe and analyze algorithms for shape-constrained symbolic regression, which allows the inclusion of prior knowledge about the shape of the regression function. This is relevant in many areas of engineering -- in particular whenever a data-driven model obtained from measurements must have certain properties (e.g. positivity, monotonicity or convexity/concavity). We implement shape constraints using a soft-penalty approach which uses multi-objective algorithms to minimize constraint violations and training error. We use the non-dominated sorting genetic algorithm (NSGA-II) as well as the multi-objective evolutionary algorithm based on decomposition (MOEA/D). We use a set of models from physics textbooks to test the algorithms and compare against earlier results with single-objective algorithms. The results show that all algorithms are able to find models which conform to all shape constraints. Using shape constraints helps to improve extrapolation behavior of the models.

NEMar 29, 2021
Shape-constrained Symbolic Regression -- Improving Extrapolation with Prior Knowledge

Gabriel Kronberger, Fabricio Olivetti de França, Bogdan Burlacu et al.

We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shape-constrained symbolic regression: i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and ii) a two population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.

NEFeb 3, 2019
Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity Measure

Bogdan Burlacu, Michael Affenzeller, Gabriel Kronberger et al.

Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives.