LGFeb 10Code
MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier DetectionXueying Ding, Simon Klüttermann, Haomin Wen et al.
Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.
LGSep 24, 2024
Exploring the Impact of Outlier Variability on Anomaly Detection Evaluation MetricsMinjae Ok, Simon Klüttermann, Emmanuel Müller
Anomaly detection is a dynamic field, in which the evaluation of models plays a critical role in understanding their effectiveness. The selection and interpretation of the evaluation metrics are pivotal, particularly in scenarios with varying amounts of anomalies. This study focuses on examining the behaviors of three widely used anomaly detection metrics under different conditions: F1 score, Receiver Operating Characteristic Area Under Curve (ROC AUC), and Precision-Recall Curve Area Under Curve (AUCPR). Our study critically analyzes the extent to which these metrics provide reliable and distinct insights into model performance, especially considering varying levels of outlier fractions and contamination thresholds in datasets. Through a comprehensive experimental setup involving widely recognized algorithms for anomaly detection, we present findings that challenge the conventional understanding of these metrics and reveal nuanced behaviors under varying conditions. We demonstrated that while the F1 score and AUCPR are sensitive to outlier fractions, the ROC AUC maintains consistency and is unaffected by such variability. Additionally, under conditions of a fixed outlier fraction in the test set, we observe an alignment between ROC AUC and AUCPR, indicating that the choice between these two metrics may be less critical in such scenarios. The results of our study contribute to a more refined understanding of metric selection and interpretation in anomaly detection, offering valuable insights for both researchers and practitioners in the field.
LGApr 4, 2024
Deep Transductive Outlier DetectionSimon Klüttermann, Emmanuel Müller
Outlier detection (OD) is one of the core challenges in machine learning. Transductive learning, which leverages test data during training, has shown promise in related machine learning tasks, yet remains largely unexplored for modern OD. We present Doust, the first end-to-end transductive deep learning algorithm for outlier detection, which explicitly leverages unlabeled test data to boost accuracy. On the comprehensive ADBench benchmark, Doust achieves an average ROC-AUC of $89%$, outperforming all 21 competitors by roughly $10%$. Our analysis identifies both the potential and a limitation of transductive OD: while performance gains can be substantial in favorable conditions, very low contamination rates can hinder improvements unless the dataset is sufficiently large.
LGAug 13, 2025
Rare anomalies require large datasets: About proving the existence of anomaliesSimon Klüttermann, Emmanuel Müller
Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ α_{\text{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ ν$, the condition $ N \ge \frac{α_{\text{algo}}}{ν^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.
LGApr 29, 2025
Unsupervised Surrogate Anomaly DetectionSimon Klüttermann, Tim Katzke, Emmanuel Müller
In this paper, we study unsupervised anomaly detection algorithms that learn a neural network representation, i.e. regular patterns of normal data, which anomalies are deviating from. Inspired by a similar concept in engineering, we refer to our methodology as surrogate anomaly detection. We formalize the concept of surrogate anomaly detection into a set of axioms required for optimal surrogate models and propose a new algorithm, named DEAN (Deep Ensemble ANomaly detection), designed to fulfill these criteria. We evaluate DEAN on 121 benchmark datasets, demonstrating its competitive performance against 19 existing methods, as well as the scalability and reliability of our method.
LGJul 21, 2025
We Need to Rethink Benchmarking in Anomaly DetectionPhilipp Röchner, Simon Klüttermann, Franz Rothlauf et al.
Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. Current benchmarking does not, for example, sufficiently reflect the diversity of anomalies in applications ranging from predictive maintenance to scientific discovery. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that capture the relevant characteristics of different applications. We identify three key areas for improvement: First, we need to identify anomaly detection scenarios based on a common taxonomy. Second, anomaly detection pipelines should be analyzed end-to-end and by component. Third, evaluating anomaly detection algorithms should be meaningful regarding the scenario's objectives.
LGJun 16, 2025
Polyra Swarms: A Shape-Based Approach to Machine LearningSimon Klüttermann, Emmanuel Müller
We propose Polyra Swarms, a novel machine-learning approach that approximates shapes instead of functions. Our method enables general-purpose learning with very low bias. In particular, we show that depending on the task, Polyra Swarms can be preferable compared to neural networks, especially for tasks like anomaly detection. We further introduce an automated abstraction mechanism that simplifies the complexity of a Polyra Swarm significantly, enhancing both their generalization and transparency. Since Polyra Swarms operate on fundamentally different principles than neural networks, they open up new research directions with distinct strengths and limitations.
LGMar 19, 2024
On the Effectiveness of Heterogeneous Ensemble Methods for Re-identificationSimon Klüttermann, Jérôme Rutinowski, Anh Nguyen et al.
In this contribution, we introduce a novel ensemble method for the re-identification of industrial entities, using images of chipwood pallets and galvanized metal plates as dataset examples. Our algorithms replace commonly used, complex siamese neural networks with an ensemble of simplified, rudimentary models, providing wider applicability, especially in hardware-restricted scenarios. Each ensemble sub-model uses different types of extracted features of the given data as its input, allowing for the creation of effective ensembles in a fraction of the training duration needed for more complex state-of-the-art models. We reach state-of-the-art performance at our task, with a Rank-1 accuracy of over 77% and a Rank-10 accuracy of over 99%, and introduce five distinct feature extraction approaches, and study their combination using different ensemble methods.