LGSep 27, 2024Code
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningJannis Becktepe, Julian Dierkes, Carolin Benjamins et al.
Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.
LGJul 17, 2023
Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoMLLennart Purucker, Lennart Schneider, Marie Anastacio et al.
Automated machine learning (AutoML) systems commonly ensemble models post hoc to improve predictive performance, typically via greedy ensemble selection (GES). However, we believe that GES may not always be optimal, as it performs a simple deterministic greedy search. In this work, we introduce two novel population-based ensemble selection methods, QO-ES and QDO-ES, and compare them to GES. While QO-ES optimises solely for predictive performance, QDO-ES also considers the diversity of ensembles within the population, maintaining a diverse set of well-performing ensembles during optimisation based on ideas of quality diversity optimisation. The methods are evaluated using 71 classification datasets from the AutoML benchmark, demonstrating that QO-ES and QDO-ES often outrank GES, albeit only statistically significant on validation data. Our results further suggest that diversity can be beneficial for post hoc ensembling but also increases the risk of overfitting.
NEMay 27
Improving Evaluation of Recombination-based Cartesian Genetic ProgrammingDuy Long Tran, Anja Jankovic, Marie Anastacio et al.
Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary search. Despite advancements in recent years, recombinationbased approaches have long been avoided, due to apparent lack of performance gains. This study examines two recently suggested recombination-based operators, subgraph crossover and discrete phenotypic recombination on SRBench, a benchmarking platform for symbolic regression. Using the implementations provided in the TinyverseGP framework, we perform hyperparameter optimisation of the respective representations with these two operators. Our work demonstrates that hyperparameter optimisation can lead to improvements in performance for recombination-based Cartesian Genetic Programming.
LGDec 4, 2025
GraphBench: Next-generation graph learning benchmarkingTimo Stoll, Chendi Qian, Ben Finkelshtein et al.
Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.
CVOct 22, 2025
Mitigating representation bias caused by missing pixels in methane plume detectionJulia Wąsala, Joannes D. Maasakkers, Ilse Aben et al.
Most satellite images have systematically missing pixels (i.e., missing data not at random (MNAR)) due to factors such as clouds. If not addressed, these missing pixels can lead to representation bias in automated feature extraction models. In this work, we show that spurious association between the label and the number of missing values in methane plume detection can cause the model to associate the coverage (i.e., the percentage of valid pixels in an image) with the label, subsequently under-detecting plumes in low-coverage images. We evaluate multiple imputation approaches to remove the dependence between the coverage and a label. Additionally, we propose a weighted resampling scheme during training that removes the association between the label and the coverage by enforcing class balance in each coverage bin. Our results show that both resampling and imputation can significantly reduce the representation bias without hurting balanced accuracy, precision, or recall. Finally, we evaluate the capability of the debiased models using these techniques in an operational scenario and demonstrate that the debiased models have a higher chance of detecting plumes in low-coverage images.
LGMay 21, 2025
Guidelines for the Quality Assessment of Energy-Aware NAS BenchmarksNick Kocher, Christian Wassermann, Leona Hennig et al.
Neural Architecture Search (NAS) accelerates progress in deep learning through systematic refinement of model architectures. The downside is increasingly large energy consumption during the search process. Surrogate-based benchmarking mitigates the cost of full training by querying a pre-trained surrogate to obtain an estimate for the quality of the model. Specifically, energy-aware benchmarking aims to make it possible for NAS to favourably trade off model energy consumption against accuracy. Towards this end, we propose three design principles for such energy-aware benchmarks: (i) reliable power measurements, (ii) a wide range of GPU usage, and (iii) holistic cost reporting. We analyse EA-HAS-Bench based on these principles and find that the choice of GPU measurement API has a large impact on the quality of results. Using the Nvidia System Management Interface (SMI) on top of its underlying library influences the sampling rate during the initial data collection, returning faulty low-power estimations. This results in poor correlation with accurate measurements obtained from an external power meter. With this study, we bring to attention several key considerations when performing energy-aware surrogate-based benchmarking and derive first guidelines that can help design novel benchmarks. We show a narrow usage range of the four GPUs attached to our device, ranging from 146 W to 305 W in a single-GPU setting, and narrowing down even further when using all four GPUs. To improve holistic energy reporting, we propose calibration experiments over assumptions made in popular tools, such as Code Carbon, thus achieving reductions in the maximum inaccuracy from 10.3 % to 8.9 % without and to 6.6 % with prior estimation of the expected load on the device.
NEApr 14, 2025
TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic ProgrammingRoman Kalkreuth, Fabricio Olivetti de França, Julian Dierkes et al.
Over the years, genetic programming (GP) has evolved, with many proposed variations, especially in how they represent a solution. Being essentially a program synthesis algorithm, it is capable of tackling multiple problem domains. Current benchmarking initiatives are fragmented, as the different representations are not compared with each other and their performance is not measured across the different domains. In this work, we propose a unified framework, dubbed TinyverseGP (inspired by tinyGP), which provides support to multiple representations and problem domains, including symbolic regression, logic synthesis and policy search.
SEMar 1, 2021
Practices for Engineering Trustworthy Machine Learning ApplicationsAlex Serban, Koen van der Blom, Holger Hoos et al.
Following the recent surge in adoption of machine learning (ML), the negative impact that improper use of ML can have on users and society is now also widely recognised. To address this issue, policy makers and other stakeholders, such as the European Commission or NIST, have proposed high-level guidelines aiming to promote trustworthy ML (i.e., lawful, ethical and robust). However, these guidelines do not specify actions to be taken by those involved in building ML systems. In this paper, we argue that guidelines related to the development of trustworthy ML can be translated to operational practices, and should become part of the ML development life cycle. Towards this goal, we ran a multi-vocal literature review, and mined operational practices from white and grey literature. Moreover, we launched a global survey to measure practice adoption and the effects of these practices. In total, we identified 14 new practices, and used them to complement an existing catalogue of ML engineering practices. Initial analysis of the survey results reveals that so far, practice adoption for trustworthy ML is relatively low. In particular, practices related to assuring security of ML components have very low adoption. Other practices enjoy slightly larger adoption, such as providing explanations to users. Our extended practice catalogue can be used by ML development teams to bridge the gap between high-level guidelines and actual development of trustworthy ML systems; it is open for review and contribution
SEJul 28, 2020
Adoption and Effects of Software Engineering Best Practices in Machine LearningAlex Serban, Koen van der Blom, Holger Hoos et al.
The increasing reliance on applications with machine learning (ML) components calls for mature engineering techniques that ensure these are built in a robust and future-proof manner. We aim to empirically determine the state of the art in how teams develop, deploy and maintain software with ML components. We mined both academic and grey literature and identified 29 engineering best practices for ML applications. We conducted a survey among 313 practitioners to determine the degree of adoption for these practices and to validate their perceived effects. Using the survey responses, we quantified practice adoption, differentiated along demographic characteristics, such as geography or team size. We also tested correlations and investigated linear and non-linear relationships between practices and their perceived effect using various statistical models. Our findings indicate, for example, that larger teams tend to adopt more practices, and that traditional software engineering practices tend to have lower adoption than ML specific practices. Also, the statistical models can accurately predict perceived effects such as agility, software quality and traceability, from the degree of adoption for specific sets of practices. Combining practice adoption rates with practice importance, as revealed by statistical models, we identify practices that are important but have low adoption, as well as practices that are widely adopted but are less important for the effects we studied. Overall, our survey and the analysis of responses received provide a quantitative basis for assessment and step-wise improvement of practice adoption by ML teams.
AIJun 8, 2015
ASlib: A Benchmark Library for Algorithm SelectionBernd Bischl, Pascal Kerschke, Lars Kotthoff et al.
The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, but the community lacks a standard format or repository for this data. This situation makes it difficult to share and compare different approaches effectively, as is done in other, more established fields. It also unnecessarily hinders new researchers who want to work in this area. To address this problem, we introduce a standardized format for representing algorithm selection scenarios and a repository that contains a growing number of data sets from the literature. Our format has been designed to be able to express a wide variety of different scenarios. Demonstrating the breadth and power of our platform, we describe a set of example experiments that build and evaluate algorithm selection models through a common interface. The results display the potential of algorithm selection to achieve significant performance improvements across a broad range of problems and algorithms.
AIMay 5, 2015
The Configurable SAT Solver Challenge (CSSC)Frank Hutter, Marius Lindauer, Adrian Balint et al.
It is well known that different solution strategies work well for different types of instances of hard combinatorial problems. As a consequence, most solvers for the propositional satisfiability problem (SAT) expose parameters that allow them to be customized to a particular family of instances. In the international SAT competition series, these parameters are ignored: solvers are run using a single default parameter setting (supplied by the authors) for all benchmark instances in a given track. While this competition format rewards solvers with robust default settings, it does not reflect the situation faced by a practitioner who only cares about performance on one particular application and can invest some time into tuning solver parameters for this application. The new Configurable SAT Solver Competition (CSSC) compares solvers in this latter setting, scoring each solver by the performance it achieved after a fully automated configuration step. This article describes the CSSC in more detail, and reports the results obtained in its two instantiations so far, CSSC 2013 and 2014.
AIMay 7, 2014
claspfolio 2: Advances in Algorithm Selection for Answer Set ProgrammingHolger Hoos, Marius Lindauer, Torsten Schaub
To appear in Theory and Practice of Logic Programming (TPLP). Building on the award-winning, portfolio-based ASP solver claspfolio, we present claspfolio 2, a modular and open solver architecture that integrates several different portfolio-based algorithm selection approaches and techniques. The claspfolio 2 solver framework supports various feature generators, solver selection approaches, solver portfolios, as well as solver-schedule-based pre-solving techniques. The default configuration of claspfolio 2 relies on a light-weight version of the ASP solver clasp to generate static and dynamic instance features. The flexible open design of claspfolio 2 is a distinguishing factor even beyond ASP. As such, it provides a unique framework for comparing and combining existing portfolio-based algorithm selection approaches and techniques in a single, unified framework. Taking advantage of this, we conducted an extensive experimental study to assess the impact of different feature sets, selection approaches and base solver portfolios. In addition to gaining substantial insights into the utility of the various approaches and techniques, we identified a default configuration of claspfolio 2 that achieves substantial performance gains not only over clasp's default configuration and the earlier version of claspfolio 2, but also over manually tuned configurations of clasp.
AIJan 6, 2014
Solver Scheduling via Answer Set ProgrammingHolger Hoos, Roland Kaminski, Marius Lindauer et al.
Although Boolean Constraint Technology has made tremendous progress over the last decade, the efficacy of state-of-the-art solvers is known to vary considerably across different types of problem instances and is known to depend strongly on algorithm parameters. This problem was addressed by means of a simple, yet effective approach using handmade, uniform and unordered schedules of multiple solvers in ppfolio, which showed very impressive performance in the 2011 SAT Competition. Inspired by this, we take advantage of the modeling and solving capacities of Answer Set Programming (ASP) to automatically determine more refined, that is, non-uniform and ordered solver schedules from existing benchmarking data. We begin by formulating the determination of such schedules as multi-criteria optimization problems and provide corresponding ASP encodings. The resulting encodings are easily customizable for different settings and the computation of optimum schedules can mostly be done in the blink of an eye, even when dealing with large runtime data sets stemming from many solvers on hundreds to thousands of instances. Also, the fact that our approach can be customized easily enabled us to swiftly adapt it to generate parallel schedules for multi-processor machines.
AIOct 7, 2013
Bayesian Optimization With Censored Response DataFrank Hutter, Holger Hoos, Kevin Leyton-Brown
Bayesian optimization (BO) aims to minimize a given blackbox function using a model that is updated whenever new evidence about the function becomes available. Here, we address the problem of BO under partially right-censored response data, where in some evaluations we only obtain a lower bound on the function value. The ability to handle such response data allows us to adaptively censor costly function evaluations in minimization problems where the cost of a function evaluation corresponds to the function value. One important application giving rise to such censored data is the runtime-minimizing variant of the algorithm configuration problem: finding settings of a given parametric algorithm that minimize the runtime required for solving problem instances from a given distribution. We demonstrate that terminating slow algorithm runs prematurely and handling the resulting right-censored observations can substantially improve the state of the art in model-based algorithm configuration.