Michael Affenzeller

17papers

150citations

Novelty36%

AI Score23

Ranked #180,535 of 201,326 authors (top 90%)#39,572 in LG (top 93%)

17 Papers

LGNov 8, 2022Code

OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language

David Muhr, Michael Affenzeller, Anthony D. Blaom

OutlierDetection.jl is an open-source ecosystem for outlier detection in Julia. It provides a range of high-performance outlier detection algorithms implemented directly in Julia. In contrast to previous packages, our ecosystem enables the development highly-scalable outlier detection algorithms using a high-level programming language. Additionally, it provides a standardized, yet flexible, interface for future outlier detection algorithms and allows for model composition unseen in previous packages. Best practices such as unit testing, continuous integration, and code coverage reporting are enforced across the ecosystem. The most recent version of OutlierDetection.jl is available at https://github.com/OutlierDetectionJL/OutlierDetection.jl.

LGMay 11, 2022

Probability Distribution of Hypervolume Improvement in Bi-objective Bayesian Optimization

Hao Wang, Kaifeng Yang, Michael Affenzeller

Hypervolume improvement (HVI) is commonly employed in multi-objective Bayesian optimization algorithms to define acquisition functions due to its Pareto-compliant property. Rather than focusing on specific statistical moments of HVI, this work aims to provide the exact expression of HVI's probability distribution for bi-objective problems. Considering a bi-variate Gaussian random variable resulting from Gaussian process (GP) modeling, we derive the probability distribution of its hypervolume improvement via a cell partition-based method. Our exact expression is superior in numerical accuracy and computation efficiency compared to the Monte Carlo approximation of HVI's distribution. Utilizing this distribution, we propose a novel acquisition function - $\varepsilon$-probability of hypervolume improvement ($\varepsilon$-PoHVI). Experimentally, we show that on many widely-applied bi-objective test problems, $\varepsilon$-PoHVI significantly outperforms other related acquisition functions, e.g., $\varepsilon$-PoI, and expected hypervolume improvement, when the GP model exhibits a large the prediction uncertainty.

AIAug 7, 2022

A Parallel Technique for Multi-objective Bayesian Global Optimization: Using a Batch Selection of Probability of Improvement

Kaifeng Yang, Guozhi Dong, Michael Affenzeller

Bayesian global optimization (BGO) is an efficient surrogate-assisted technique for problems involving expensive evaluations. A parallel technique can be used to parallelly evaluate the true-expensive objective functions in one iteration to boost the execution time. An effective and straightforward approach is to design an acquisition function that can evaluate the performance of a bath of multiple solutions, instead of a single point/solution, in one iteration. This paper proposes five alternatives of \emph{Probability of Improvement} (PoI) with multiple points in a batch (q-PoI) for multi-objective Bayesian global optimization (MOBGO), taking the covariance among multiple points into account. Both exact computational formulas and the Monte Carlo approximation algorithms for all proposed q-PoIs are provided. Based on the distribution of the multiple points relevant to the Pareto-front, the position-dependent behavior of the five q-PoIs is investigated. Moreover, the five q-PoIs are compared with the other nine state-of-the-art and recently proposed batch MOBGO algorithms on twenty bio-objective benchmarks. The empirical experiments on different variety of benchmarks are conducted to demonstrate the effectiveness of two greedy q-PoIs ($\kpoi_{\mbox{best}}$ and $\kpoi_{\mbox{all}}$) on low-dimensional problems and the effectiveness of two explorative q-PoIs ($\kpoi_{\mbox{one}}$ and $\kpoi_{\mbox{worst}}$) on high-dimensional problems with difficult-to-approximate Pareto front boundaries.

MTRL-SCIJun 13, 2022

Symbolic Regression in Materials Science: Discovering Interatomic Potentials from Data

Bogdan Burlacu, Michael Kommenda, Gabriel Kronberger et al.

Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost. Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials. This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data.

NEMar 3, 2023

Vectorial Genetic Programming -- Optimizing Segments for Feature Extraction

Philipp Fleck, Stephan Winkler, Michael Kommenda et al.

Vectorial Genetic Programming (Vec-GP) extends GP by allowing vectors as input features along regular, scalar features, using them by applying arithmetic operations component-wise or aggregating vectors into scalars by some aggregation function. Vec-GP also allows aggregating vectors only over a limited segment of the vector instead of the whole vector, which offers great potential but also introduces new parameters that GP has to optimize. This paper formalizes an optimization problem to analyze different strategies for optimizing a window for aggregation functions. Different strategies are presented, included random and guided sampling, where the latter leverages information from an approximated gradient. Those strategies can be applied as a simple optimization algorithm, which itself ca be applied inside a specialized mutation operator within GP. The presented results indicate, that the different random sampling strategies do not impact the overall algorithm performance significantly, and that the guided strategies suffer from becoming stuck in local optima. However, results also indicate, that there is still potential in discovering more efficient algorithms that could outperform the presented strategies.

LGMay 16, 2023

A Probabilistic Transformation of Distance-Based Outliers

David Muhr, Michael Affenzeller, Josef Küng

The scores of distance-based outlier detection methods are difficult to interpret, making it challenging to determine a cut-off threshold between normal and outlier data points without additional context. We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates. The transformation is ranking-stable and increases the contrast between normal and outlier data points. Determining distance relationships between data points is necessary to identify the nearest-neighbor relationships in the data, yet, most of the computed distances are typically discarded. We show that the distances to other data points can be used to model distance probability distributions and, subsequently, use the distributions to turn distance-based outlier scores into outlier probabilities. Our experiments show that the probabilistic transformation does not impact detection performance over numerous tabular and image benchmark datasets but results in interpretable outlier scores with increased contrast between normal and outlier samples. Our work generalizes to a wide range of distance-based outlier detection methods, and because existing distance computations are used, it adds no significant computational overhead.

LGSep 28, 2021

Cluster Analysis of a Symbolic Regression Search Space

Gabriel Kronberger, Lukas Kammerer, Bogdan Burlacu et al.

In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.

LGSep 28, 2021

Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication

Lukas Kammerer, Gabriel Kronberger, Bogdan Burlacu et al.

Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.

LGSep 1, 2021

Optimization Networks for Integrated Machine Learning

Michael Kommenda, Johannes Karder, Andreas Beham et al.

Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks.

LGSep 1, 2021

Complexity Measures for Multi-objective Symbolic Regression

Michael Kommenda, Andreas Beham, Michael Affenzeller et al.

Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted and need not be specified a-priori. The result of the optimization is not a single solution anymore, but a whole Pareto-front describing the trade-off between accuracy and complexity. In this contribution we study which complexity measures are most appropriately used in symbolic regression when performing multi- objective optimization with NSGA-II. Furthermore, we present a novel complexity measure that includes semantic information based on the function symbols occurring in the models and test its effects on several benchmark datasets. Results comparing multiple complexity measures are presented in terms of the achieved accuracy and model length to illustrate how the search direction of the algorithm is affected.

LGAug 24, 2021

On the Effectiveness of Genetic Operations in Symbolic Regression

Bogdan Burlacu, Michael Affenzeller, Michael Kommenda

This paper describes a methodology for analyzing the evolutionary dynamics of genetic programming (GP) using genealogical information, diversity measures and information about the fitness variation from parent to offspring. We introduce a new subtree tracing approach for identifying the origins of genes in the structure of individuals, and we show that only a small fraction of ancestor individuals are responsible for the evolvement of the best solutions in the population.

LGAug 6, 2021

Concept Drift Detection with Variable Interaction Networks

Jan Zenisek, Gabriel Kronberger, Josef Wolfartsberger et al.

The current development of today's production industry towards seamless sensor-based monitoring is paving the way for concepts such as Predictive Maintenance. By this means, the condition of plants and products in future production lines will be continuously analyzed with the objective to predict any kind of breakdown and trigger preventing actions proactively. Such ambitious predictions are commonly performed with support of machine learning algorithms. In this work, we utilize these algorithms to model complex systems, such as production plants, by focusing on their variable interactions. The core of this contribution is a sliding window based algorithm, designed to detect changes of the identified interactions, which might indicate beginning malfunctions in the context of a monitored production plant. Besides a detailed description of the algorithm, we present results from experiments with a synthetic dynamical system, simulating stable and drifting system behavior.

LGJul 22, 2021

Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression

Bogdan Burlacu, Lukas Kammerer, Michael Affenzeller et al.

We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems.

AIFeb 25, 2019

Resource-constrained multi-project scheduling with activity and time flexibility

Viktoria A. Hauder, Andreas Beham, Sebastian Raggl et al.

Project scheduling in manufacturing environments often requires flexibility in terms of the selection and the exact length of alternative production activities. Moreover, the simultaneous scheduling of multiple lots is mandatory in many production planning applications. To meet these requirements, a new resource-constrained project scheduling problem (RCPSP) is introduced where both decisions (activity flexibility and time flexibility) are integrated. Besides the minimization of makespan, two new alternative objectives are presented: maximization of balanced length of selected activities (time balance) and maximization of balanced resource utilization (resource balance). New mixed integer and constraint programming (CP) models are proposed for the developed integrated flexible project scheduling problem. Benchmark instances on an already existing flexible RCPSP and the newly developed problem are solved to optimality. The real-world applicability of the suggested CP models is shown by additionally solving a large industry case.

NEFeb 3, 2019

Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity Measure

Bogdan Burlacu, Michael Affenzeller, Gabriel Kronberger et al.

Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives.

NESep 23, 2013

Data Mining using Unguided Symbolic Regression on a Blast Furnace Dataset

Michael Kommenda, Gabriel Kronberger, Christoph Feilmayr et al.

In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model approximating the target variable is created. The genetic programming configurations with different target variables are executed multiple times to reduce stochastic effects and the aggregated results are displayed as a variable interaction network. This interaction network highlights important system components and implicit relations between the variables. The whole approach is tested on a blast furnace dataset, because of the complexity of the blast furnace and the many interrelations between the variables. Finally the achieved results are discussed with respect to existing knowledge about the blast furnace process.

NESep 23, 2013

On the Success Rate of Crossover Operators for Genetic Programming with Offspring Selection

Gabriel Kronberger, Stephan Winkler, Michael Affenzeller et al.

Genetic programming is a powerful heuristic search technique that is used for a number of real world applications to solve among others regression, classification, and time-series forecasting problems. A lot of progress towards a theoretic description of genetic programming in form of schema theorems has been made, but the internal dynamics and success factors of genetic programming are still not fully understood. In particular, the effects of different crossover operators in combination with offspring selection are largely unknown. This contribution sheds light on the ability of well-known GP crossover operators to create better offspring when applied to benchmark problems. We conclude that standard (sub-tree swapping) crossover is a good default choice in combination with offspring selection, and that GP with offspring selection and random selection of crossover operators can improve the performance of the algorithm in terms of best solution quality when no solution size constraints are applied.