MTRL-SCIJun 13, 2022
Symbolic Regression in Materials Science: Discovering Interatomic Potentials from DataBogdan Burlacu, Michael Kommenda, Gabriel Kronberger et al.
Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost. Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials. This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data.
LGSep 20, 2022
Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares OptimizationLukas Kammerer, Gabriel Kronberger, Michael Kommenda
Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.
NEMar 3, 2023
Vectorial Genetic Programming -- Optimizing Segments for Feature ExtractionPhilipp Fleck, Stephan Winkler, Michael Kommenda et al.
Vectorial Genetic Programming (Vec-GP) extends GP by allowing vectors as input features along regular, scalar features, using them by applying arithmetic operations component-wise or aggregating vectors into scalars by some aggregation function. Vec-GP also allows aggregating vectors only over a limited segment of the vector instead of the whole vector, which offers great potential but also introduces new parameters that GP has to optimize. This paper formalizes an optimization problem to analyze different strategies for optimizing a window for aggregation functions. Different strategies are presented, included random and guided sampling, where the latter leverages information from an approximated gradient. Those strategies can be applied as a simple optimization algorithm, which itself ca be applied inside a specialized mutation operator within GP. The presented results indicate, that the different random sampling strategies do not impact the overall algorithm performance significantly, and that the guided strategies suffer from becoming stuck in local optima. However, results also indicate, that there is still potential in discovering more efficient algorithms that could outperform the presented strategies.
NEJul 29, 2021Code
Contemporary Symbolic Regression Methods and their Relative PerformanceWilliam La Cava, Patryk Orzechowski, Bogdan Burlacu et al.
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems, including physics equations and systems of ordinary differential equations. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that deep learning and genetic algorithm-based approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.
LGSep 28, 2021
Cluster Analysis of a Symbolic Regression Search SpaceGabriel Kronberger, Lukas Kammerer, Bogdan Burlacu et al.
In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.
LGSep 28, 2021
Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure DeduplicationLukas Kammerer, Gabriel Kronberger, Bogdan Burlacu et al.
Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.
LGSep 8, 2021
Understanding and Preparing Data of Industrial Processes for Machine Learning ApplicationsPhilipp Fleck, Manfred Kügel, Michael Kommenda
Industrial applications of machine learning face unique challenges due to the nature of raw industry data. Preprocessing and preparing raw industrial data for machine learning applications is a demanding task that often takes more time and work than the actual modeling process itself and poses additional challenges. This paper addresses one of those challenges, specifically, the challenge of missing values due to sensor unavailability at different production units of nonlinear production lines. In cases where only a small proportion of the data is missing, those missing values can often be imputed. In cases of large proportions of missing data, imputing is often not feasible, and removing observations containing missing values is often the only option. This paper presents a technique, that allows to utilize all of the available data without the need of removing large amounts of observations where data is only partially available. We do not only discuss the principal idea of the presented method, but also show different possible implementations that can be applied depending on the data at hand. Finally, we demonstrate the application of the presented method with data from a steel production plant.
LGSep 8, 2021
Preprocessing and Modeling of Radial Fan Data for Health State PredictionFlorian Holzinger, Michael Kommenda
Monitoring critical components of systems is a crucial step towards failure safety. Affordable sensors are available and the industry is in the process of introducing and extending monitoring solutions to improve product quality. Often, no expertise of how much data is required for a certain task (e.g. monitoring) exists. Especially in vital machinery, a trend to exaggerated sensors may be noticed, both in quality and in quantity. This often results in an excessive generation of data, which should be transferred, processed and stored nonetheless. In a previous case study, several sensors have been mounted on a healthy radial fan, which was later artificially damaged. The gathered data was used for modeling (and therefore monitoring) a healthy state. The models were evaluated on a dataset created by using a faulty impeller. This paper focuses on the reduction of this data through downsampling and binning. Different models are created with linear regression and random forest regression and the resulting difference in quality is discussed.
LGSep 1, 2021
Optimization Networks for Integrated Machine LearningMichael Kommenda, Johannes Karder, Andreas Beham et al.
Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks.
LGSep 1, 2021
Complexity Measures for Multi-objective Symbolic RegressionMichael Kommenda, Andreas Beham, Michael Affenzeller et al.
Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted and need not be specified a-priori. The result of the optimization is not a single solution anymore, but a whole Pareto-front describing the trade-off between accuracy and complexity. In this contribution we study which complexity measures are most appropriately used in symbolic regression when performing multi- objective optimization with NSGA-II. Furthermore, we present a novel complexity measure that includes semantic information based on the function symbols occurring in the models and test its effects on several benchmark datasets. Results comparing multiple complexity measures are presented in terms of the achieved accuracy and model length to illustrate how the search direction of the algorithm is affected.
LGAug 24, 2021
On the Effectiveness of Genetic Operations in Symbolic RegressionBogdan Burlacu, Michael Affenzeller, Michael Kommenda
This paper describes a methodology for analyzing the evolutionary dynamics of genetic programming (GP) using genealogical information, diversity measures and information about the fitness variation from parent to offspring. We introduce a new subtree tracing approach for identifying the origins of genes in the structure of individuals, and we show that only a small fraction of ancestor individuals are responsible for the evolvement of the best solutions in the population.
LGAug 24, 2021
Data Aggregation for Reducing Training Data in Symbolic RegressionLukas Kammerer, Gabriel Kronberger, Michael Kommenda
The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.
NEAug 3, 2021
Extending a Physics-Based Constitutive Model using Genetic ProgrammingGabriel Kronberger, Evgeniya Kabliman, Johannes Kronsteiner et al.
In material science, models are derived to predict emergent material properties (e.g. elasticity, strength, conductivity) and their relations to processing conditions. A major drawback is the calibration of model parameters that depend on processing conditions. Currently, these parameters must be optimized to fit measured data since their relations to processing conditions (e.g. deformation temperature, strain rate) are not fully understood. We present a new approach that identifies the functional dependency of calibration parameters from processing conditions based on genetic programming. We propose two (explicit and implicit) methods to identify these dependencies and generate short interpretable expressions. The approach is used to extend a physics-based constitutive model for deformation processes. This constitutive model operates with internal material variables such as a dislocation density and contains a number of parameters, among them three calibration parameters. The derived expressions extend the constitutive model and replace the calibration parameters. Thus, interpolation between various processing parameters is enabled. Our results show that the implicit method is computationally more expensive than the explicit approach but also produces significantly better results.
LGJul 19, 2021
Predicting Friction System Performance with Symbolic Regression and Genetic Programming with Factor VariablesGabriel Kronberger, Michael Kommenda, Andreas Promberger et al.
Friction systems are mechanical systems wherein friction is used for force transmission (e.g. mechanical braking systems or automatic gearboxes). For finding optimal and safe design parameters, engineers have to predict friction system performance. This is especially difficult in real-world applications, because it is affected by many parameters. We have used symbolic regression and genetic programming for finding accurate and trustworthy prediction models for this task. However, it is not straight-forward how nominal variables can be included. In particular, a one-hot-encoding is unsatisfactory because genetic programming tends to remove such indicator variables. We have therefore used so-called factor variables for representing nominal variables in symbolic regression models. Our results show that GP is able to produce symbolic regression models for predicting friction performance with predictive accuracy that is comparable to artificial neural networks. The symbolic regression models with factor variables are less complex than models using a one-hot encoding.
LGJul 6, 2021
Identification of Dynamical Systems using Symbolic RegressionGabriel Kronberger, Lukas Kammerer, Michael Kommenda
We describe a method for the identification of models for dynamical systems from observational data. The method is based on the concept of symbolic regression and uses genetic programming to evolve a system of ordinary differential equations (ODE). The novelty is that we add a step of gradient-based optimization of the ODE parameters. For this we calculate the sensitivities of the solution to the initial value problem (IVP) using automatic differentiation. The proposed approach is tested on a set of 19 problem instances taken from the literature which includes datasets from simulated systems as well as datasets captured from mechanical systems. We find that gradient-based optimization of parameters improves predictive accuracy of the models. The best results are obtained when we first fit the individual equations to the numeric differences and then subsequently fine-tune the identified parameter values by fitting the IVP solution to the observed variable values.
NEMar 29, 2021
Shape-constrained Symbolic Regression -- Improving Extrapolation with Prior KnowledgeGabriel Kronberger, Fabricio Olivetti de França, Bogdan Burlacu et al.
We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shape-constrained symbolic regression: i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and ii) a two population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.
NEFeb 3, 2019
Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity MeasureBogdan Burlacu, Michael Affenzeller, Gabriel Kronberger et al.
Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives.
NESep 23, 2013
Data Mining using Unguided Symbolic Regression on a Blast Furnace DatasetMichael Kommenda, Gabriel Kronberger, Christoph Feilmayr et al.
In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model approximating the target variable is created. The genetic programming configurations with different target variables are executed multiple times to reduce stochastic effects and the aggregated results are displayed as a variable interaction network. This interaction network highlights important system components and implicit relations between the variables. The whole approach is tested on a blast furnace dataset, because of the complexity of the blast furnace and the many interrelations between the variables. Finally the achieved results are discussed with respect to existing knowledge about the blast furnace process.
NEMay 16, 2013
Evolution of Covariance Functions for Gaussian Process Regression using Genetic ProgrammingGabriel Kronberger, Michael Kommenda
In this contribution we describe an approach to evolve composite covariance functions for Gaussian processes using genetic programming. A critical aspect of Gaussian processes and similar kernel-based models such as SVM is, that the covariance function should be adapted to the modeled data. Frequently, the squared exponential covariance function is used as a default. However, this can lead to a misspecified model, which does not fit the data well. In the proposed approach we use a grammar for the composition of covariance functions and genetic programming to search over the space of sentences that can be derived from the grammar. We tested the proposed approach on synthetic data from two-dimensional test functions, and on the Mauna Loa CO2 time series. The results show, that our approach is feasible, finding covariance functions that perform much better than a default covariance function. For the CO2 data set a composite covariance function is found, that matches the performance of a hand-tuned covariance function.