Ludwig Bothmann

h-index48

16papers

118citations

Novelty38%

AI Score41

Ranked #92,297 of 201,018 authors (top 46%)#20,699 in LG (top 49%)

16 Papers

CVMar 28, 2023

Automated wildlife image classification: An active learning tool for ecological applications

Ludwig Bothmann, Lisa Wimmer, Omid Charrakh et al.

Wildlife camera trap images are being used extensively to investigate animal abundance, habitat associations, and behavior, which is complicated by the fact that experts must first classify the images manually. Artificial intelligence systems can take over this task but usually need a large number of already-labeled training images to achieve sufficient performance. This requirement necessitates human expert labor and poses a particular challenge for projects with few cameras or short durations. We propose a label-efficient learning strategy that enables researchers with small or medium-sized image databases to leverage the potential of modern machine learning, thus freeing crucial resources for subsequent analyses. Our methodological proposal is two-fold: (1) We improve current strategies of combining object detection and image classification by tuning the hyperparameters of both models. (2) We provide an active learning (AL) system that allows training deep learning models very efficiently in terms of required human-labeled training images. We supply a software package that enables researchers to use these methods directly and thereby ensure the broad applicability of the proposed framework in ecological practice. We show that our tuning strategy improves predictive performance. We demonstrate how the AL pipeline reduces the amount of pre-labeled data needed to achieve a specific predictive performance and that it is especially valuable for improving out-of-sample predictive performance. We conclude that the combination of tuning and AL increases predictive performance substantially. Furthermore, we argue that our work can broadly impact the community through the ready-to-use software package provided. Finally, the publication of our models tailored to European wildlife data enriches existing model bases mostly trained on data from Africa and North America.

MLOct 23, 2023

Evaluating machine learning models in non-standard settings: An overview and new findings

Roman Hornung, Malte Nalenz, Lennart Schneider et al.

Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines for GE estimation in various such non-standard settings: clustered data, spatial data, unequal sampling probabilities, concept drift, and hierarchically structured outcomes. Our overview combines well-established methodologies with other existing methods that, to our knowledge, have not been frequently considered in these particular settings. A unifying principle among these techniques is that the test data used in each iteration of the resampling procedure should reflect the new observations to which the model will be applied, while the training data should be representative of the entire data set used to obtain the final model. Beyond providing an overview, we address literature gaps by conducting simulation studies. These studies assess the necessity of using GE-estimation methods tailored to the respective setting. Our findings corroborate the concern that standard resampling methods often yield biased GE estimates in non-standard settings, underscoring the importance of tailored GE estimation.

LGJul 24, 2023

Causal Fair Machine Learning via Rank-Preserving Interventional Distributions

Ludwig Bothmann, Susanne Dandl, Michael Schomaker

A decision can be defined as fair if equal individuals are treated equally and unequals unequally. Adopting this definition, the task of designing machine learning (ML) models that mitigate unfairness in automated decision-making systems must include causal thinking when introducing protected attributes: Following a recent proposal, we define individuals as being normatively equal if they are equal in a fictitious, normatively desired (FiND) world, where the protected attributes have no (direct or indirect) causal effect on the target. We propose rank-preserving interventional distributions to define a specific FiND world in which this holds and a warping method for estimation. Evaluation criteria for both the method and the resulting ML model are presented and validated through simulations. Experiments on empirical data showcase the practical application of our method and compare results with "fairadapt" (Plečko and Meinshausen, 2020), a different approach for mitigating unfairness by causally preprocessing data that uses quantile regression forests. With this, we show that our warping approach effectively identifies the most discriminated individuals and mitigates unfairness.

LGMay 19, 2022

What Is Fairness? On the Role of Protected Attributes and Fictitious Worlds

Ludwig Bothmann, Kristina Peters, Bernd Bischl

A growing body of literature in fairness-aware machine learning (fairML) aims to mitigate machine learning (ML)-related unfairness in automated decision-making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods to ensure that trained ML models achieve low scores on these metrics. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a significant gap between centuries of philosophical discussion and the recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We argue that fairness problems can arise even without the presence of protected attributes (PAs), and point out that fairness and predictive performance are not irreconcilable opposites, but that the latter is necessary to achieve the former. Furthermore, we argue why and how causal considerations are necessary when assessing fairness in the presence of PAs by proposing a fictitious, normatively desired (FiND) world in which PAs have no causal effects. In practice, this FiND world must be approximated by a warped world in which the causal effects of the PAs are removed from the real-world data. Finally, we achieve greater linguistic clarity in the discussion of fairML. We outline algorithms for practical applications and present illustrative experiments on COMPAS data.

CVJan 22

Beyond Off-the-Shelf Models: A Lightweight and Accessible Machine Learning Pipeline for Ecologists Working with Image Data

Clare Chemery, Hendrik Edelhoff, Ludwig Bothmann

We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.

CYJul 28, 2021Code

Developing Open Source Educational Resources for Machine Learning and Data Science

Ludwig Bothmann, Sven Strickroth, Giuseppe Casalicchio et al.

Education should not be a privilege but a common good. It should be openly accessible to everyone, with as few barriers as possible; even more so for key technologies such as Machine Learning (ML) and Data Science (DS). Open Educational Resources (OER) are a crucial factor for greater educational equity. In this paper, we describe the specific requirements for OER in ML and DS and argue that it is especially important for these fields to make source files publicly available, leading to Open Source Educational Resources (OSER). We present our view on the collaborative development of OSER, the challenges this poses, and first steps towards their solutions. We outline how OSER can be used for blended learning scenarios and share our experiences in university education. Finally, we discuss additional challenges such as credit assignment or granting certificates.

MLApr 19, 2024

A Guide to Feature Importance Methods for Scientific Inference

Fiona Katharina Ewald, Ludwig Bothmann, Marvin N. Wright et al.

While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of global FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models.

MLJan 24, 2025

Overcoming Fairness Trade-offs via Pre-processing: A Causal Perspective

Charlotte Leininger, Simon Rittel, Ludwig Bothmann

Training machine learning models for fair decisions faces two key challenges: The \emph{fairness-accuracy trade-off} results from enforcing fairness which weakens its predictive performance in contrast to an unconstrained model. The incompatibility of different fairness metrics poses another trade-off -- also known as the \emph{impossibility theorem}. Recent work identifies the bias within the observed data as a possible root cause and shows that fairness and predictive performance are in fact in accord when predictive performance is measured on unbiased data. We offer a causal explanation for these findings using the framework of the FiND (fictitious and normatively desired) world, a "fair" world, where protected attributes have no causal effects on the target variable. We show theoretically that (i) classical fairness metrics deemed to be incompatible are naturally satisfied in the FiND world, while (ii) fairness aligns with high predictive performance. We extend our analysis by suggesting how one can benefit from these theoretical insights in practice, using causal pre-processing methods that approximate the FiND world. Additionally, we propose a method for evaluating the approximation of the FiND world via pre-processing in practical use cases where we do not have access to the FiND world. In simulations and empirical studies, we demonstrate that these pre-processing methods are successful in approximating the FiND world and resolve both trade-offs. Our results provide actionable solutions for practitioners to achieve fairness and high predictive performance simultaneously.

LGFeb 13, 2025

Trust Me, I Know the Way: Predictive Uncertainty in the Presence of Shortcut Learning

Lisa Wimmer, Bernd Bischl, Ludwig Bothmann

The correct way to quantify predictive uncertainty in neural networks remains a topic of active discussion. In particular, it is unclear whether the state-of-the art entropy decomposition leads to a meaningful representation of model, or epistemic, uncertainty (EU) in the light of a debate that pits ignorance against disagreement perspectives. We aim to reconcile the conflicting viewpoints by arguing that both are valid but arise from different learning situations. Notably, we show that the presence of shortcuts is decisive for EU manifesting as disagreement.

LGSep 24, 2025

TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees

Emmanouil Panagiotou, Benoît Ronval, Arjun Roy et al.

Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.

IRAug 5, 2025

Are All Genders Equal in the Eyes of Algorithms? -- Analysing Search and Retrieval Algorithms for Algorithmic Gender Fairness

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher et al.

Algorithmic systems such as search engines and information retrieval platforms significantly influence academic visibility and the dissemination of knowledge. Despite assumptions of neutrality, these systems can reproduce or reinforce societal biases, including those related to gender. This paper introduces and applies a bias-preserving definition of algorithmic gender fairness, which assesses whether algorithmic outputs reflect real-world gender distributions without introducing or amplifying disparities. Using a heterogeneous dataset of academic profiles from German universities and universities of applied sciences, we analyse gender differences in metadata completeness, publication retrieval in academic databases, and visibility in Google search results. While we observe no overt algorithmic discrimination, our findings reveal subtle but consistent imbalances: male professors are associated with a greater number of search results and more aligned publication records, while female professors display higher variability in digital visibility. These patterns reflect the interplay between platform algorithms, institutional curation, and individual self-presentation. Our study highlights the need for fairness evaluations that account for both technical performance and representational equality in digital systems.

LGFeb 26, 2025

Invariance Pair-Guided Learning: Enhancing Robustness in Neural Networks

Martin Surner, Abdelmajid Khelil, Ludwig Bothmann

Out-of-distribution generalization of machine learning models remains challenging since the models are inherently bound to the training data distribution. This especially manifests, when the learned models rely on spurious correlations. Most of the existing approaches apply data manipulation, representation learning, or learning strategies to achieve generalizable models. Unfortunately, these approaches usually require multiple training domains, group labels, specialized augmentation, or pre-processing to reach generalizable models. We propose a novel approach that addresses these limitations by providing a technique to guide the neural network through the training phase. We first establish input pairs, representing the spurious attribute and describing the invariance, a characteristic that should not affect the outcome of the model. Based on these pairs, we form a corrective gradient complementing the traditional gradient descent approach. We further make this correction mechanism adaptive based on a predefined invariance condition. Experiments on ColoredMNIST, Waterbird-100, and CelebA datasets demonstrate the effectiveness of our approach and the robustness to group shifts.

LGFeb 3, 2025

Privilege Scores

Ludwig Bothmann, Philip A. Boustani, Jose M. Alvarez et al.

Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence of the PA is removed. At the individual level, PS can identify individuals who qualify for affirmative action; at the global level, PS can inform bias-transforming policies. After presenting estimation methods for PS, we propose privilege score contributions (PSCs), an interpretation method that attributes the origin of privilege to mediating features and direct effects. We provide confidence intervals for both PS and PSCs. Experiments on simulated and real-world data demonstrate the broad applicability of our methods and provide novel insights into gender and racial privilege in mortgage and college admissions applications.

LGApr 25, 2024

mlr3summary: Concise and interpretable summaries for machine learning models

Susanne Dandl, Marc Becker, Bernd Bischl et al.

This work introduces a novel R package for concise, informative summaries of machine learning models. We take inspiration from the summary function for (generalized) linear models in R, but extend it in several directions: First, our summary function is model-agnostic and provides a unified summary output also for non-parametric machine learning models; Second, the summary output is more extensive and customizable -- it comprises information on the dataset, model performance, model complexity, model's estimated feature importances, feature effects, and fairness metrics; Third, models are evaluated based on resampling strategies for unbiased estimates of model performances, feature importances, etc. Overall, the clear, structured output should help to enhance and expedite the model selection process, making it a helpful tool for practitioners and researchers alike.

LGFeb 2, 2024

Connecting the Dots: Is Mode-Connectedness the Key to Feasible Sample-Based Inference in Bayesian Neural Networks?

Emanuel Sommer, Lisa Wimmer, Theodore Papamarkou et al.

A major challenge in sample-based inference (SBI) for Bayesian neural networks is the size and structure of the networks' parameter space. Our work shows that successful SBI is possible by embracing the characteristic relationship between weight and function space, uncovering a systematic link between overparameterization and the difficulty of the sampling problem. Through extensive experiments, we establish practical guidelines for sampling and convergence diagnosis. As a result, we present a deep ensemble initialized approach as an effective solution with competitive performance and uncertainty quantification.

MLMay 4, 2023

Interpretable Regional Descriptors: Hyperbox-Based Local Explanations

Susanne Dandl, Giuseppe Casalicchio, Bernd Bischl et al.

This work introduces interpretable regional descriptors, or IRDs, for local, model-agnostic interpretations. IRDs are hyperboxes that describe how an observation's feature values can be changed without affecting its prediction. They justify a prediction by providing a set of "even if" arguments (semi-factual explanations), and they indicate which features affect a prediction and whether pointwise biases or implausibilities exist. A concrete use case shows that this is valuable for both machine learning modelers and persons subject to a decision. We formalize the search for IRDs as an optimization problem and introduce a unifying framework for computing IRDs that covers desiderata, initialization techniques, and a post-processing method. We show how existing hyperbox methods can be adapted to fit into this unified framework. A benchmark study compares the methods based on several quality measures and identifies two strategies to improve IRDs.