LGJun 9, 2022
DORA: Exploring Outlier Representations in Deep Neural NetworksKirill Bykov, Mayukh Deb, Dennis Grinwald et al.
Deep Neural Networks (DNNs) excel at learning complex abstractions within their internal representations. However, the concepts they learn remain opaque, a problem that becomes particularly acute when models unintentionally learn spurious correlations. In this work, we present DORA (Data-agnOstic Representation Analysis), the first data-agnostic framework for analyzing the representational space of DNNs. Central to our framework is the proposed Extreme-Activation (EA) distance measure, which assesses similarities between representations by analyzing their activation patterns on data points that cause the highest level of activation. As spurious correlations often manifest in features of data that are anomalous to the desired task, such as watermarks or artifacts, we demonstrate that internal representations capable of detecting such artifactual concepts can be found by analyzing relationships within neural representations. We validate the EA metric quantitatively, demonstrating its effectiveness both in controlled scenarios and real-world applications. Finally, we provide practical examples from popular Computer Vision models to illustrate that representations identified as outliers using the EA metric often correspond to undesired and spurious concepts.
LGMar 9, 2023
Mark My Words: Dangers of Watermarked Images in ImageNetKirill Bykov, Klaus-Robert Müller, Marina M. -C. Höhne
The utilization of pre-trained networks, especially those trained on ImageNet, has become a common practice in Computer Vision. However, prior research has indicated that a significant number of images in the ImageNet dataset contain watermarks, making pre-trained networks susceptible to learning artifacts such as watermark patterns within their latent spaces. In this paper, we aim to assess the extent to which popular pre-trained architectures display such behavior and to determine which classes are most affected. Additionally, we examine the impact of watermarks on the extracted features. Contrary to the popular belief that the Chinese logographic watermarks impact the "carton" class only, our analysis reveals that a variety of ImageNet classes, such as "monitor", "broom", "apron" and "safe" rely on spurious correlations. Finally, we propose a simple approach to mitigate this issue in fine-tuned networks by ignoring the encodings from the feature-extractor layer of ImageNet pre-trained networks that are most susceptible to watermark imprints.
LGNov 22, 2023
Labeling Neural Representations with Inverse RecognitionKirill Bykov, Laura Kopf, Shinichi Nakajima et al.
Deep Neural Networks (DNNs) demonstrate remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.
AIMay 9
Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of PersonasNils A. Herrmann, Leander Girrbach, Kirill Bykov et al.
Recent work shows that large language models (LLMs) encode behavioural traits ("personas") as linear directions in activation space, often called "persona vectors". Prior work has used such directions as static handles for behavioural steering. Building on this, we treat them as dynamic signals instead: probes we can monitor and intervene on as reasoning unfolds. We use the term polylogue to denote the time series of alignments between persona vectors and hidden activations over the course of generation. Experiments across four open-weight models show that polylogue features predict correctness on MMLU-Pro competitively with low-dimensional activation baselines, while remaining interpretable through their associated persona directions. They also suggest concrete steering targets, namely which latent directions to modulate at different stages of a response. We instantiate this as a simple paragraph-conditioned intervention that improves accuracy on three of four models, pointing to stage-aware latent steering as a promising direction for reasoning-time control. Together, this positions the polylogue as an interpretable tool for reasoning-time monitoring and intervention.
LGJan 11, 2024
Manipulating Feature Visualizations with Gradient SlingshotsDilyara Bareeva, Marina M. -C. Höhne, Alexander Warnecke et al.
Feature Visualization (FV) is a widely used technique for interpreting the concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. In this paper, we introduce a novel method, Gradient Slingshots, that enables manipulation of FV without modifying the model architecture or significantly degrading its performance. By shaping new trajectories in the off-distribution regions of the activation landscape of a feature, we coerce the optimization process to converge in a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithfuls FV with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.
LGJun 18, 2025
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description FrameworkLaura Kopf, Nils Feldhus, Kirill Bykov et al.
Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated interpretability methods in NLP, PRISM produces more nuanced descriptions that account for both monosemantic and polysemantic behavior. We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).
LGApr 10, 2025
Deep Learning Meets Teleconnections: Improving S2S Predictions for European Winter WeatherPhiline L. Bommer, Marlene Kretschmer, Fiona R. Spuler et al.
Predictions on subseasonal-to-seasonal (S2S) timescales--ranging from two weeks to two month--are crucial for early warning systems but remain challenging owing to chaos in the climate system. Teleconnections, such as the stratospheric polar vortex (SPV) and Madden-Julian Oscillation (MJO), offer windows of enhanced predictability, however, their complex interactions remain underutilized in operational forecasting. Here, we developed and evaluated deep learning architectures to predict North Atlantic-European (NAE) weather regimes, systematically assessing the role of remote drivers in improving S2S forecast skill of deep learning models. We implemented (1) a Long Short-term Memory (LSTM) network predicting the NAE regimes of the next six weeks based on previous regimes, (2) an Index-LSTM incorporating SPV and MJO indices, and (3) a ViT-LSTM using a Vision Transformer to directly encode stratospheric wind and tropical outgoing longwave radiation fields. These models are compared with operational hindcasts as well as other AI models. Our results show that leveraging teleconnection information enhances skill at longer lead times. Notably, the ViT-LSTM outperforms ECMWF's subseasonal hindcasts beyond week 4 by improving Scandinavian Blocking (SB) and Atlantic Ridge (AR) predictions. Analysis of high-confidence predictions reveals that NAO-, SB, and AR opportunity forecasts can be associated with SPV variability and MJO phase patterns aligning with established pathways, also indicating new patterns. Overall, our work demonstrates that encoding physically meaningful climate fields can enhance S2S prediction skill, advancing AI-driven subseasonal forecast. Moreover, the experiments highlight the potential of deep learning methods as investigative tools, providing new insights into atmospheric dynamics and predictability.
LGJan 26, 2022
Visualizing the Diversity of Representations Learned by Bayesian Neural NetworksDennis Grinwald, Kirill Bykov, Shinichi Nakajima et al.
Explainable Artificial Intelligence (XAI) aims to make learning machines less opaque, and offers researchers and practitioners various tools to reveal the decision-making strategies of neural networks. In this work, we investigate how XAI methods can be used for exploring and visualizing the diversity of feature representations learned by Bayesian Neural Networks (BNNs). Our goal is to provide a global understanding of BNNs by making their decision-making strategies a) visible and tangible through feature visualizations and b) quantitatively measurable with a distance measure learned by contrastive learning. Our work provides new insights into the \emph{posterior} distribution in terms of human-understandable feature information with regard to the underlying decision making strategies. The main findings of our work are the following: 1) global XAI methods can be applied to explain the diversity of decision-making strategies of BNN instances, 2) Monte Carlo dropout with commonly used Dropout rates exhibit increased diversity in feature representations compared to the multimodal posterior approximation of MultiSWAG, 3) the diversity of learned feature representations highly correlates with the uncertainty estimate for the output and 4) the inter-mode diversity of the multimodal posterior decreases as the network width increases, while the intra mode diversity increases. These findings are consistent with the recent Deep Neural Networks theory, providing additional intuitions about what the theory implies in terms of humanly understandable concepts.
LGAug 23, 2021
Explaining Bayesian Neural NetworksKirill Bykov, Marina M. -C. Höhne, Adelaida Creosteanu et al.
To advance the transparency of learning machines such as Deep Neural Networks (DNNs), the field of Explainable AI (XAI) was established to provide interpretations of DNNs' predictions. While different explanation techniques exist, a popular approach is given in the form of attribution maps, which illustrate, given a particular data point, the relevant patterns the model has used for making its prediction. Although Bayesian models such as Bayesian Neural Networks (BNNs) have a limited form of transparency built-in through their prior weight distribution, they lack explanations of their predictions for given instances. In this work, we take a step toward combining these two perspectives by examining how local attributions can be extended to BNNs. Within the Bayesian framework, network weights follow a probability distribution; hence, the standard point explanation extends naturally to an explanation distribution. Viewing explanations probabilistically, we aggregate and analyze multiple local attributions drawn from an approximate posterior to explore variability in explanation patterns. The diversity of explanations offers a way to further explore how predictive rationales may vary across posterior samples. Quantitative and qualitative experiments on toy and benchmark data, as well as on a real-world pathology dataset, illustrate that our framework enriches standard explanations with uncertainty information and may support the visualization of explanation stability.
LGJun 18, 2021
NoiseGrad: Enhancing Explanations by Introducing Stochasticity to Model WeightsKirill Bykov, Anna Hedström, Shinichi Nakajima et al.
Many efforts have been made for revealing the decision-making process of black-box learning machines such as deep neural networks, resulting in useful local and global explanation methods. For local explanation, stochasticity is known to help: a simple method, called SmoothGrad, has improved the visual quality of gradient-based attribution by adding noise to the input space and averaging the explanations of the noisy inputs. In this paper, we extend this idea and propose NoiseGrad that enhances both local and global explanation methods. Specifically, NoiseGrad introduces stochasticity in the weight parameter space, such that the decision boundary is perturbed. NoiseGrad is expected to enhance the local explanation, similarly to SmoothGrad, due to the dual relationship between the input perturbation and the decision boundary perturbation. We evaluate NoiseGrad and its fusion with SmoothGrad -- FusionGrad -- qualitatively and quantitatively with several evaluation criteria, and show that our novel approach significantly outperforms the baseline methods. Both NoiseGrad and FusionGrad are method-agnostic and as handy as SmoothGrad using a simple heuristic for the choice of the hyperparameter setting without the need of finetuning.
LGJun 16, 2020
How Much Can I Trust You? -- Quantifying Uncertainties in Explaining Neural NetworksKirill Bykov, Marina M. -C. Höhne, Klaus-Robert Müller et al.
Explainable AI (XAI) aims to provide interpretations for predictions made by learning machines, such as deep neural networks, in order to make the machines more transparent for the user and furthermore trustworthy also for applications in e.g. safety-critical areas. So far, however, no methods for quantifying uncertainties of explanations have been conceived, which is problematic in domains where a high confidence in explanations is a prerequisite. We therefore contribute by proposing a new framework that allows to convert any arbitrary explanation method for neural networks into an explanation method for Bayesian neural networks, with an in-built modeling of uncertainties. Within the Bayesian framework a network's weights follow a distribution that extends standard single explanation scores and heatmaps to distributions thereof, in this manner translating the intrinsic network model uncertainties into a quantification of explanation uncertainties. This allows us for the first time to carve out uncertainties associated with a model explanation and subsequently gauge the appropriate level of explanation confidence for a user (using percentiles). We demonstrate the effectiveness and usefulness of our approach extensively in various experiments, both qualitatively and quantitatively.