HCMay 18
Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay UsersAlfio Ventura, Tim Katzke, Jan Corazza et al.
Trust calibration -- aligning user trust judgment with model capability -- is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.
LGMar 18
FoMo X: Modular Explainability Signals for Outlier Detection Foundation ModelsSimon Klüttermann, Tim Katzke, Phuong Huong Nguyen et al.
Tabular foundation models, specifically Prior-Data Fitted Networks (PFNs), have revolutionized outlier detection (OD) by enabling unsupervised zero-shot adaptation to new datasets without training. However, despite their predictive power, these models typically function as opaque black boxes, outputting scalar outlier scores that lack the operational context required for safety-critical decision-making. Existing post-hoc explanation methods are often computationally prohibitive for real-time deployment or fail to capture the epistemic uncertainty inherent in zero-shot inference. In this work, we introduce FoMo-X, a modular framework that equips OD foundation models with intrinsic, lightweight diagnostic capabilities. We leverage the insight that the frozen embeddings of a pretrained PFN backbone already encode rich, context-conditioned relational information. FoMo-X attaches auxiliary diagnostic heads to these embeddings, trained offline using the same generative simulator prior as the backbone. This allows us to distill computationally expensive properties, such as Monte Carlo dropout based epistemic uncertainty, into a deterministic, single-pass inference. We instantiate FoMo-X with two novel heads: a Severity Head that discretizes deviations into interpretable risk tiers, and an Uncertainty Head that provides calibrated confidence measures. Extensive evaluation on synthetic and real-world benchmarks (ADBench) demonstrates that FoMo-X recovers ground-truth diagnostic signals with high fidelity and negligible inference overhead. By bridging the gap between foundation model performance and operational explainability, FoMo-X offers a scalable path toward trustworthy, zero-shot outlier detection.
LGMar 18
Unsupervised Symbolic Anomaly DetectionMd Maruf Hossain, Tim Katzke, Simon Klüttermann et al.
We propose SYRAN, an unsupervised anomaly detection method based on symbolic regression. Instead of encoding normal patterns in an opaque, high-dimensional model, our method learns an ensemble of human-readable equations that describe symbolic invariants: functions that are approximately constant on normal data. Deviations from these invariants yield anomaly scores, so that the detection logic is interpretable by construction, rather than via post-hoc explanation. Experimental results demonstrate that SYRAN is highly interpretable, providing equations that correspond to known scientific or medical relationships, and maintains strong anomaly detection performance comparable to that of state-of-the-art methods.
LGFeb 21, 2025
A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in GermanyIna Dormuth, Sven Franke, Marlies Hafer et al.
In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.
LGApr 29, 2025
Unsupervised Surrogate Anomaly DetectionSimon Klüttermann, Tim Katzke, Emmanuel Müller
In this paper, we study unsupervised anomaly detection algorithms that learn a neural network representation, i.e. regular patterns of normal data, which anomalies are deviating from. Inspired by a similar concept in engineering, we refer to our methodology as surrogate anomaly detection. We formalize the concept of surrogate anomaly detection into a set of axioms required for optimal surrogate models and propose a new algorithm, named DEAN (Deep Ensemble ANomaly detection), designed to fulfill these criteria. We evaluate DEAN on 121 benchmark datasets, demonstrating its competitive performance against 19 existing methods, as well as the scalability and reliability of our method.
LGOct 10, 2025
On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class ClassificationFaried Abu Zaid, Tim Katzke, Emmanuel Müller et al.
Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.
CVAug 15, 2025
From Pixels to Graphs: Deep Graph-Level Anomaly Detection on Dermoscopic ImagesDehn Xu, Tim Katzke, Emmanuel Müller
Graph Neural Networks (GNNs) have emerged as a powerful approach for graph-based machine learning tasks. Previous work applied GNNs to image-derived graph representations for various downstream tasks such as classification or anomaly detection. These transformations include segmenting images, extracting features from segments, mapping them to nodes, and connecting them. However, to the best of our knowledge, no study has rigorously compared the effectiveness of the numerous potential image-to-graph transformation approaches for GNN-based graph-level anomaly detection (GLAD). In this study, we systematically evaluate the efficacy of multiple segmentation schemes, edge construction strategies, and node feature sets based on color, texture, and shape descriptors to produce suitable image-derived graph representations to perform graph-level anomaly detection. We conduct extensive experiments on dermoscopic images using state-of-the-art GLAD models, examining performance and efficiency in purely unsupervised, weakly supervised, and fully supervised regimes. Our findings reveal, for example, that color descriptors contribute the best standalone performance, while incorporating shape and texture features consistently enhances detection efficacy. In particular, our best unsupervised configuration using OCGTL achieves a competitive AUC-ROC score of up to 0.805 without relying on pretrained backbones like comparable image-based approaches. With the inclusion of sparse labels, the performance increases substantially to 0.872 and with full supervision to 0.914 AUC-ROC.