Mennatallah El‐Assady

h-index24

44papers

2,093citations

Novelty37%

AI Score42

Ranked #57,678 of 194,257 authors (top 30%)#333 in HC (top 13%)

44 Papers

25.2CLNov 23, 2022Code

Automatic Generation of Socratic Subquestions for Teaching Math Word Problems

Kumar Shridhar, Jakub Macina, Mennatallah El-Assady et al. · cmu, eth-zurich

Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring understanding of the reasoning process involved in the problem. We hypothesize that such questioning strategy can not only enhance the human performance, but also assist the math word problem (MWP) solvers. In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problem-solving. We propose various guided question generation schemes based on input conditioning and reinforcement learning. On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. We conduct a preliminary user study to examine the potential value of such question generation models in the education domain. Results suggest that the difficulty level of problems plays an important role in determining whether questioning improves or hinders human performance. We discuss the future of using such questioning strategies in education.

20.5HCNov 28, 2023

RELIC: Investigating Large Language Model Responses using Self-Consistency

Furui Cheng, Vilém Zouhar, Simran Arora et al. · eth-zurich

Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. To address this challenge, we propose an interactive system that helps users gain insight into the reliability of the generated text. Our approach is based on the idea that the self-consistency of multiple samples generated by the same LLM relates to its confidence in individual claims in the generated texts. Using this idea, we design RELIC, an interactive system that enables users to investigate and verify semantic-level variations in multiple long-form responses. This allows users to recognize potentially inaccurate information in the generated text and make necessary corrections. From a user study with ten participants, we demonstrate that our approach helps users better verify the reliability of the generated text. We further summarize the design implications and lessons learned from this research for future studies of reliable human-LLM interactions.

22.4CLOct 20, 2023Code

A Diachronic Perspective on User Trust in AI under Uncertainty

Shehzaad Dhuliawala, Vilém Zouhar, Mennatallah El-Assady et al. · eth-zurich

In a human-AI collaboration, users build a mental model of the AI system based on its reliability and how it presents its decision, e.g. its presentation of system confidence and an explanation of the output. Modern NLP systems are often uncalibrated, resulting in confidently incorrect predictions that undermine user trust. In order to build trustworthy AI, we must understand how user trust is developed and how it can be regained after potential trust-eroding events. We study the evolution of user trust in response to these trust-eroding events using a betting game. We find that even a few incorrect instances with inaccurate confidence estimates damage user trust and performance, with very slow recovery. We also show that this degradation in trust reduces the success of human-AI collaboration and that different types of miscalibration -- unconfidently correct and confidently incorrect -- have different negative effects on user trust. Our findings highlight the importance of calibration in user-facing AI applications and shed light on what aspects help users decide whether to trust the AI system.

8.1HCJul 14, 2023

Visual Explanations with Attributions and Counterfactuals on Time Series Classification

Udo Schlegel, Daniela Oelke, Daniel A. Keim et al.

With the rising necessity of explainable artificial intelligence (XAI), we see an increase in task-dependent XAI methods on varying abstraction levels. XAI techniques on a global level explain model behavior and on a local level explain sample predictions. We propose a visual analytics workflow to support seamless transitions between global and local explanations, focusing on attributions and counterfactuals on time series classification. In particular, we adapt local XAI techniques (attributions) that are developed for traditional datasets (images, text) to analyze time series classification, a data type that is typically less intelligible to humans. To generate a global overview, we apply local attribution methods to the data, creating explanations for the whole dataset. These explanations are projected onto two dimensions, depicting model behavior trends, strategies, and decision boundaries. To further inspect the model decision-making as well as potential data errors, a what-if analysis facilitates hypothesis generation and verification on both the global and local levels. We constantly collected and incorporated expert user feedback, as well as insights based on their domain knowledge, resulting in a tailored analysis workflow and system that tightly integrates time series transformations into explanations. Lastly, we present three use cases, verifying that our technique enables users to (1)~explore data transformations and feature relevance, (2)~identify model behavior and decision boundaries, as well as, (3)~the reason for misclassifications.

14.7CVSep 25, 2024Code

Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics

Lukas Klein, Carsten T. Lüth, Udo Schlegel et al.

Explainable AI (XAI) is a rapidly growing domain with a myriad of proposed methods as well as metrics aiming to evaluate their efficacy. However, current studies are often of limited scope, examining only a handful of XAI methods and ignoring underlying design parameters for performance, such as the model architecture or the nature of input data. Moreover, they often rely on one or a few metrics and neglect thorough validation, increasing the risk of selection bias and ignoring discrepancies among metrics. These shortcomings leave practitioners confused about which method to choose for their problem. In response, we introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We systematically incorporate vital design parameters like varied architectures and diverse input modalities, resulting in 7,560 examined combinations. Through LATEC, we showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme. Further, we comprehensively evaluate various XAI methods to assist practitioners in selecting appropriate methods aligning with their needs. Curiously, the emerging top-performing method, Expected Gradients, is not examined in any relevant related study. LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-)evaluation dataset. The benchmark is hosted at: https://github.com/IML-DKFZ/latec.

3.9CVJun 15, 2023Code

Improving Explainability of Disentangled Representations using Multipath-Attribution Mappings

Lukas Klein, João B. S. Carvalho, Mennatallah El-Assady et al.

Explainable AI aims to render model behavior understandable by humans, which can be seen as an intermediate step in extracting causal relations from correlative patterns. Due to the high risk of possible fatal decisions in image-based clinical diagnostics, it is necessary to integrate explainable AI into these safety-critical systems. Current explanatory methods typically assign attribution scores to pixel regions in the input image, indicating their importance for a model's decision. However, they fall short when explaining why a visual feature is used. We propose a framework that utilizes interpretable disentangled representations for downstream-task prediction. Through visualizing the disentangled representations, we enable experts to investigate possible causation effects by leveraging their domain knowledge. Additionally, we deploy a multi-path attribution mapping for enriching and validating explanations. We demonstrate the effectiveness of our approach on a synthetic benchmark suite and two medical datasets. We show that the framework not only acts as a catalyst for causal relation extraction but also enhances model robustness by enabling shortcut detection without the need for testing under distribution shifts.

1.8LGOct 7, 2022

How to Enable Uncertainty Estimation in Proximal Policy Optimization

Eugene Bykovets, Yannick Metz, Mennatallah El-Assady et al.

While deep reinforcement learning (RL) agents have showcased strong results across many domains, a major concern is their inherent opaqueness and the safety of such systems in real-world use cases. To overcome these issues, we need agents that can quantify their uncertainty and detect out-of-distribution (OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo Dropout or Deep Ensembles, have not seen widespread adoption in on-policy deep RL. We posit that this is due to two reasons: concepts like uncertainty and OOD states are not well defined compared to supervised learning, especially for on-policy RL methods. Secondly, available implementations and comparative studies for uncertainty estimation methods in RL have been limited. To overcome the first gap, we propose definitions of uncertainty and OOD for Actor-Critic RL algorithms, namely, proximal policy optimization (PPO), and present possible applicable measures. In particular, we discuss the concepts of value and policy uncertainty. The second point is addressed by implementing different uncertainty estimation methods and comparing them across a number of environments. The OOD detection performance is evaluated via a custom evaluation benchmark of in-distribution (ID) and OOD states for various RL environments. We identify a trade-off between reward and OOD detection performance. To overcome this, we formulate a Pareto optimization problem in which we simultaneously optimize for reward and OOD detection performance. We show experimentally that the recently proposed method of Masksembles strikes a favourable balance among the survey methods, enabling high-quality uncertainty estimation and OOD detection while matching the performance of original RL agents.

5.1HCSep 22, 2022

Characterizing Uncertainty in the Visual Text Analysis Pipeline

Pantea Haghighatkhah, Mennatallah El-Assady, Jean-Daniel Fekete et al.

Current visual text analysis approaches rely on sophisticated processing pipelines. Each step of such a pipeline potentially amplifies any uncertainties from the previous step. To ensure the comprehensibility and interoperability of the results, it is of paramount importance to clearly communicate the uncertainty not only of the output but also within the pipeline. In this paper, we characterize the sources of uncertainty along the visual text analysis pipeline. Within its three phases of labeling, modeling, and analysis, we identify six sources, discuss the type of uncertainty they create, and how they propagate.

17.3LGJun 27, 2022

Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning

David Lindner, Mennatallah El-Assady

Reinforcement learning (RL) commonly assumes access to well-specified reward functions, which many practical applications do not provide. Instead, recently, more work has explored learning what to do from interacting with humans. So far, most of these approaches model humans as being (nosily) rational and, in particular, giving unbiased feedback. We argue that these models are too simplistic and that RL researchers need to develop more realistic human models to design and evaluate their algorithms. In particular, we argue that human models have to be personal, contextual, and dynamic. This paper calls for research from different disciplines to address key questions about how humans provide feedback to AIs and how we can build more robust human-in-the-loop RL systems.

18.4AIAug 17, 2022

Visual Comparison of Language Model Adaptation

Rita Sevastjanova, Eren Cakmak, Shauli Ravfogel et al.

Neural language models are widely used; however, their model parameters often need to be adapted to the specific domains and tasks of an application, which is time- and resource-consuming. Thus, adapters have recently been introduced as a lightweight alternative for model adaptation. They consist of a small set of task-specific parameters with a reduced training time and simple parameter composition. The simplicity of adapter training and composition comes along with new challenges, such as maintaining an overview of adapter properties and effectively comparing their produced embedding spaces. To help developers overcome these challenges, we provide a twofold contribution. First, in close collaboration with NLP researchers, we conducted a requirement analysis for an approach supporting adapter evaluation and detected, among others, the need for both intrinsic (i.e., embedding similarity-based) and extrinsic (i.e., prediction-based) explanation methods. Second, motivated by the gathered requirements, we designed a flexible visual analytics workspace that enables the comparison of adapter properties. In this paper, we discuss several design iterations and alternatives for interactive, comparative visual explanation methods. Our comparative visualizations show the differences in the adapted embedding vectors and prediction outcomes for diverse human-interpretable concepts (e.g., person names, human qualities). We evaluate our workspace through case studies and show that, for instance, an adapter trained on the language debiasing task according to context-0 (decontextualized) embeddings introduces a new type of bias where words (even gender-independent words such as countries) become more similar to female than male pronouns. We demonstrate that these are artifacts of context-0 embeddings.

15.5LGAug 8, 2023

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Yannick Metz, David Lindner, Raphaël Baur et al.

To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback and to consider human factors involved in providing feedback of different types. However, the systematic study of learning from diverse types of feedback is held back by limited standardized tooling available to researchers. To bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for learning from human feedback. RLHF-Blender provides a modular experimentation framework and implementation that enables researchers to systematically investigate the properties and qualities of human feedback for reward learning. The system facilitates the exploration of various feedback types, including demonstrations, rankings, comparisons, and natural language instructions, as well as studies considering the impact of human factors on their effectiveness. We discuss a set of concrete research opportunities enabled by RLHF-Blender. More information is available at https://rlhfblender.info/.

1.8LGAug 22, 2022

BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning

Eugene Bykovets, Yannick Metz, Mennatallah El-Assady et al.

Robustness to adversarial perturbations has been explored in many areas of computer vision. This robustness is particularly relevant in vision-based reinforcement learning, as the actions of autonomous agents might be safety-critic or impactful in the real world. We investigate the susceptibility of vision-based reinforcement learning agents to gradient-based adversarial attacks and evaluate a potential defense. We observe that Bottleneck Attention Modules (BAM) included in CNN architectures can act as potential tools to increase robustness against adversarial attacks. We show how learned attention maps can be used to recover activations of a convolutional layer by restricting the spatial activations to salient regions. Across a number of RL environments, BAM-enhanced architectures show increased robustness during inference. Finally, we discuss potential future research directions.

2.3CLJul 14, 2022

Beware the Rationalization Trap! When Language Model Explainability Diverges from our Mental Models of Language

Rita Sevastjanova, Mennatallah El-Assady

Language models learn and represent language differently than humans; they learn the form and not the meaning. Thus, to assess the success of language model explainability, we need to consider the impact of its divergence from a user's mental model of language. In this position paper, we argue that in order to avoid harmful rationalization and achieve truthful understanding of language models, explanation processes must satisfy three main conditions: (1) explanations have to truthfully represent the model behavior, i.e., have a high fidelity; (2) explanations must be complete, as missing information distorts the truth; and (3) explanations have to take the user's mental model into account, progressively verifying a person's knowledge and adapting their understanding. We introduce a decision tree model to showcase potential reasons why current explanations fail to reach their objectives. We further emphasize the need for human-centered design to explain the model from multiple perspectives, progressively adapting explanations to changing user expectations.

11.7AISep 28, 2023

GInX-Eval: Towards In-Distribution Evaluation of Graph Neural Network Explanations

Kenza Amara, Mennatallah El-Assady, Rex Ying

Diverse explainability methods of graph neural networks (GNN) have recently been developed to highlight the edges and nodes in the graph that contribute the most to the model predictions. However, it is not clear yet how to evaluate the correctness of those explanations, whether it is from a human or a model perspective. One unaddressed bottleneck in the current evaluation procedure is the problem of out-of-distribution explanations, whose distribution differs from those of the training data. This important issue affects existing evaluation metrics such as the popular faithfulness or fidelity score. In this paper, we show the limitations of faithfulness metrics. We propose GInX-Eval (Graph In-distribution eXplanation Evaluation), an evaluation procedure of graph explanations that overcomes the pitfalls of faithfulness and offers new insights on explainability methods. Using a fine-tuning strategy, the GInX score measures how informative removed edges are for the model and the EdgeRank score evaluates if explanatory edges are correctly ordered by their importance. GInX-Eval verifies if ground-truth explanations are instructive to the GNN model. In addition, it shows that many popular methods, including gradient-based methods, produce explanations that are not better than a random designation of edges as important subgraphs, challenging the findings of current works in the area. Results with GInX-Eval are consistent across multiple datasets and align with human evaluation.

26.3CLJun 21, 2023

Which Spurious Correlations Impact Reasoning in NLI Models? A Visual Interactive Diagnosis through Data-Constrained Counterfactuals

Robin Chan, Afra Amini, Mennatallah El-Assady

We present a human-in-the-loop dashboard tailored to diagnosing potential spurious features that NLI models rely on for predictions. The dashboard enables users to generate diverse and challenging examples by drawing inspiration from GPT-3 suggestions. Additionally, users can receive feedback from a trained NLI model on how challenging the newly created example is and make refinements based on the feedback. Through our investigation, we discover several categories of spurious correlations that impact the reasoning of NLI models, which we group into three categories: Semantic Relevance, Logical Fallacies, and Bias. Based on our findings, we identify and describe various research opportunities, including diversifying training data and assessing NLI models' robustness by creating adversarial test suites.

1.4CVJul 11, 2022

From Correlation to Causation: Formalizing Interpretable Machine Learning as a Statistical Process

Lukas Klein, Mennatallah El-Assady, Paul F. Jäger

Explainable AI (XAI) is a necessity in safety-critical systems such as in clinical diagnostics due to a high risk for fatal decisions. Currently, however, XAI resembles a loose collection of methods rather than a well-defined process. In this work, we elaborate on conceptual similarities between the largest subgroup of XAI, interpretable machine learning (IML), and classical statistics. Based on these similarities, we present a formalization of IML along the lines of a statistical process. Adopting this statistical view allows us to interpret machine learning models and IML methods as sophisticated statistical tools. Based on this interpretation, we infer three key questions, which we identify as crucial for the success and adoption of IML in safety-critical settings. By formulating these questions, we further aim to spark a discussion about what distinguishes IML from classical statistics and what our perspective implies for the future of the field.

3.3CLOct 17, 2023

Revealing the Unwritten: Visual Investigation of Beam Search Trees to Address Language Model Prompting Challenges

Thilo Spinner, Rebecca Kehlbeck, Rita Sevastjanova et al.

The growing popularity of generative language models has amplified interest in interactive methods to guide model outputs. Prompt refinement is considered one of the most effective means to influence output among these methods. We identify several challenges associated with prompting large language models, categorized into data- and model-specific, linguistic, and socio-linguistic challenges. A comprehensive examination of model outputs, including runner-up candidates and their corresponding probabilities, is needed to address these issues. The beam search tree, the prevalent algorithm to sample model outputs, can inherently supply this information. Consequently, we introduce an interactive visual method for investigating the beam search tree, facilitating analysis of the decisions made by the model during generation. We quantitatively show the value of exposing the beam search tree and present five detailed analysis scenarios addressing the identified challenges. Our methodology validates existing results and offers additional insights.

7.2HCJul 8

Creativity from Friction: Human-AI Interaction for Exploratory Structural Design

Ricardo Maia Avelino, Rita Sevastjanova, Tom Van Mele et al.

AI agents that generate final answers based on user input often do not meet the needs of creative fields. Fields such as structural design and architecture need interactive systems that help users externalise and develop ideas, explore alternatives, and refine partial solutions. The final product of such designs needs to comply with many constraints concerning, e.g., spatial configuration, mechanical behaviour, material quantities, and costs. These constraints create friction in the design process, which can stimulate novel and creative solutions. In this paper, we discuss the misalignment between current generative AI goals to remove friction and provide final solutions and the needs of creators, such as structural designers, who develop ideas through iterative work. We present the design dimensions of systems allowing for constrained human-AI co-creation that rely on vision-language models making structural exploration conversational, multimodal, and responsive to evolving human intent in ways that follow and augment the discipline's creative process. Through a pilot design interface based on these principles and a study with experts in the field, this paper shows how structural designers perceive interactive AI systems and how such systems can support design space exploration by reducing repetitive modelling friction while preserving reflective design friction.

4.9HCJul 25, 2024

iNNspector: Visual, Interactive Deep Model Debugging

Thilo Spinner, Daniel Fürst, Mennatallah El-Assady

Deep learning model design, development, and debugging is a process driven by best practices, guidelines, trial-and-error, and the personal experiences of model developers. At multiple stages of this process, performance and internal model data can be logged and made available. However, due to the sheer complexity and scale of this data and process, model developers often resort to evaluating their model performance based on abstract metrics like accuracy and loss. We argue that a structured analysis of data along the model's architecture and at multiple abstraction levels can considerably streamline the debugging process. Such a systematic analysis can further connect the developer's design choices to their impacts on the model behavior, facilitating the understanding, diagnosis, and refinement of deep learning models. Hence, in this paper, we (1) contribute a conceptual framework structuring the data space of deep learning experiments. Our framework, grounded in literature analysis and requirements interviews, captures design dimensions and proposes mechanisms to make this data explorable and tractable. To operationalize our framework in a ready-to-use application, we (2) present the iNNspector system. iNNspector enables tracking of deep learning experiments and provides interactive visualizations of the data on all levels of abstraction from multiple models to individual neurons. Finally, we (3) evaluate our approach with three real-world use-cases and a user study with deep learning developers and data analysts, proving its effectiveness and usability.

12.0HCApr 18, 2024

Deconstructing Human-AI Collaboration: Agency, Interaction, and Adaptation

Steffen Holter, Mennatallah El-Assady

As full AI-based automation remains out of reach in most real-world applications, the focus has instead shifted to leveraging the strengths of both human and AI agents, creating effective collaborative systems. The rapid advances in this area have yielded increasingly more complex systems and frameworks, while the nuance of their characterization has gotten more vague. Similarly, the existing conceptual models no longer capture the elaborate processes of these systems nor describe the entire scope of their collaboration paradigms. In this paper, we propose a new unified set of dimensions through which to analyze and describe human-AI systems. Our conceptual model is centered around three high-level aspects - agency, interaction, and adaptation - and is developed through a multi-step process. Firstly, an initial design space is proposed by surveying the literature and consolidating existing definitions and conceptual frameworks. Secondly, this model is iteratively refined and validated by conducting semi-structured interviews with nine researchers in this field. Lastly, to illustrate the applicability of our design space, we utilize it to provide a structured description of selected human-AI systems.

6.1CLApr 23, 2024

Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

Furui Cheng, Vilém Zouhar, Robin Shing Moon Chan et al. · eth-zurich

Understanding the behavior of large language models (LLMs) is crucial for ensuring their safe and reliable use. However, existing explainable AI (XAI) methods for LLMs primarily rely on word-level explanations, which are often computationally inefficient and misaligned with human reasoning processes. Moreover, these methods often treat explanation as a one-time output, overlooking its inherently interactive and iterative nature. In this paper, we present LLM Analyzer, an interactive visualization system that addresses these limitations by enabling intuitive and efficient exploration of LLM behaviors through counterfactual analysis. Our system features a novel algorithm that generates fluent and semantically meaningful counterfactuals via targeted removal and replacement operations at user-defined levels of granularity. These counterfactuals are used to compute feature attribution scores, which are then integrated with concrete examples in a table-based visualization, supporting dynamic analysis of model behavior. A user study with LLM practitioners and interviews with experts demonstrate the system's usability and effectiveness, emphasizing the importance of involving humans in the explanation process as active participants rather than passive recipients.

8.3HCMar 12, 2024

generAItor: Tree-in-the-Loop Text Generation for Language Model Explainability and Adaptation

Thilo Spinner, Rebecca Kehlbeck, Rita Sevastjanova et al.

Large language models (LLMs) are widely deployed in various downstream tasks, e.g., auto-completion, aided writing, or chat-based text generation. However, the considered output candidates of the underlying search algorithm are under-explored and under-explained. We tackle this shortcoming by proposing a tree-in-the-loop approach, where a visual representation of the beam search tree is the central component for analyzing, explaining, and adapting the generated outputs. To support these tasks, we present generAItor, a visual analytics technique, augmenting the central beam search tree with various task-specific widgets, providing targeted visualizations and interaction possibilities. Our approach allows interactions on multiple levels and offers an iterative pipeline that encompasses generating, exploring, and comparing output candidates, as well as fine-tuning the model based on adapted data. Our case study shows that our tool generates new insights in gender bias analysis beyond state-of-the-art template-based methods. Additionally, we demonstrate the applicability of our approach in a qualitative user study. Finally, we quantitatively evaluate the adaptability of the model to few samples, as occurring in text-generation use cases.

10.4LGNov 18, 2024

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Yannick Metz, David Lindner, Raphaël Baur et al.

Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar to how humans interact in social contexts, we can use many types of feedback to communicate our preferences, intentions, and knowledge to an RL agent. However, applications of human feedback in RL are often limited in scope and disregard human factors. In this work, we bridge the gap between machine learning and human-computer interaction efforts by developing a shared understanding of human feedback in interactive learning scenarios. We first introduce a taxonomy of feedback types for reward-based learning from human feedback based on nine key dimensions. Our taxonomy allows for unifying human-centered, interface-centered, and model-centered aspects. In addition, we identify seven quality metrics of human feedback influencing both the human ability to express feedback and the agent's ability to learn from the feedback. Based on the feedback taxonomy and quality criteria, we derive requirements and design choices for systems learning from human feedback. We relate these requirements and design choices to existing work in interactive machine learning. In the process, we identify gaps in existing work and future research opportunities. We call for interdisciplinary collaboration to harness the full potential of reinforcement learning with data-driven co-adaptive modeling and varied interaction mechanics.

16.6CLFeb 14, 2024Code

SyntaxShap: Syntax-aware Explainability Method for Text Generation

Kenza Amara, Rita Sevastjanova, Mennatallah El-Assady

To harness the power of large language models in safety-critical domains, we need to ensure the explainability of their predictions. However, despite the significant attention to model interpretability, there remains an unexplored domain in explaining sequence-to-sequence tasks using methods tailored for textual data. This paper introduces SyntaxShap, a local, model-agnostic explainability method for text generation that takes into consideration the syntax in the text data. The presented work extends Shapley values to account for parsing-based syntactic dependencies. Taking a game theoric approach, SyntaxShap only considers coalitions constraint by the dependency tree. We adopt a model-based evaluation to compare SyntaxShap and its weighted form to state-of-the-art explainability methods adapted to text generation tasks, using diverse metrics including faithfulness, coherency, and semantic alignment of the explanations to the model. We show that our syntax-aware method produces explanations that help build more faithful and coherent explanations for predictions by autoregressive models. Confronted with the misalignment of human and AI model reasoning, this paper also highlights the need for cautious evaluation strategies in explainable AI.

13.0CLMay 12, 2025Code

Concept-Level Explainability for Auditing & Steering LLM Responses

Kenza Amara, Rita Sevastjanova, Mennatallah El-Assady

As large language models (LLMs) become widely deployed, concerns about their safety and alignment grow. An approach to steer LLM behavior, such as mitigating biases or defending against jailbreaks, is to identify which parts of a prompt influence specific aspects of the model's output. Token-level attribution methods offer a promising solution, but still struggle in text generation, explaining the presence of each token in the output separately, rather than the underlying semantics of the entire LLM response. We introduce ConceptX, a model-agnostic, concept-level explainability method that identifies the concepts, i.e., semantically rich tokens in the prompt, and assigns them importance based on the outputs' semantic similarity. Unlike current token-level methods, ConceptX also offers to preserve context integrity through in-place token replacements and supports flexible explanation goals, e.g., gender bias. ConceptX enables both auditing, by uncovering sources of bias, and steering, by modifying prompts to shift the sentiment or reduce the harmfulness of LLM responses, without requiring retraining. Across three LLMs, ConceptX outperforms token-level methods like TokenSHAP in both faithfulness and human alignment. Steering tasks boost sentiment shift by 0.252 versus 0.131 for random edits and lower attack success rates from 0.463 to 0.242, outperforming attribution and paraphrasing baselines. While prompt engineering and self-explaining methods sometimes yield safer responses, ConceptX offers a transparent and faithful alternative for improving LLM safety and alignment, demonstrating the practical value of attribution-based explainability in guiding LLM behavior.

4.3CGJul 24, 2025

Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents

Xinyuan Yan, Rita Sevastjanova, Sinie van der Ben et al.

Large language models (LLMs) produce high-dimensional embeddings that capture rich semantic and syntactic relationships between words, sentences, and concepts. Investigating the topological structures of LLM embedding spaces via mapper graphs enables us to understand their underlying structures. Specifically, a mapper graph summarizes the topological structure of the embedding space, where each node represents a topological neighborhood (containing a cluster of embeddings), and an edge connects two nodes if their corresponding neighborhoods overlap. However, manually exploring these embedding spaces to uncover encoded linguistic properties requires considerable human effort. To address this challenge, we introduce a framework for semi-automatic annotation of these embedding properties. To organize the exploration process, we first define a taxonomy of explorable elements within a mapper graph such as nodes, edges, paths, components, and trajectories. The annotation of these elements is executed through two types of customizable LLM-based agents that employ perturbation techniques for scalable and automated analysis. These agents help to explore and explain the characteristics of mapper elements and verify the robustness of the generated explanations. We instantiate the framework within a visual analytics workspace and demonstrate its effectiveness through case studies. In particular, we replicate findings from prior research on BERT's embedding properties across various layers of its architecture and provide further observations into the linguistic properties of topological neighborhoods.

8.3CLJul 7, 2025

Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Chenfei Xiong, Jingwei Ni, Yu Fan et al. · eth-zurich

We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

9.5HCJul 24, 2025

DxHF: Providing High-Quality Human Feedback for LLM Alignment via Interactive Decomposition

Danqing Shi, Furui Cheng, Tino Weinkauf et al.

Human preferences are widely used to align large language models (LLMs) through methods such as reinforcement learning from human feedback (RLHF). However, the current user interfaces require annotators to compare text paragraphs, which is cognitively challenging when the texts are long or unfamiliar. This paper contributes by studying the decomposition principle as an approach to improving the quality of human feedback for LLM alignment. This approach breaks down the text into individual claims instead of directly comparing two long-form text responses. Based on the principle, we build a novel user interface DxHF. It enhances the comparison process by showing decomposed claims, visually encoding the relevance of claims to the conversation and linking similar claims. This allows users to skim through key information and identify differences for better and quicker judgment. Our technical evaluation shows evidence that decomposition generally improves feedback accuracy regarding the ground truth, particularly for users with uncertainty. A crowdsourcing study with 160 participants indicates that using DxHF improves feedback accuracy by an average of 5%, although it increases the average feedback time by 18 seconds. Notably, accuracy is significantly higher in situations where users have less certainty. The finding of the study highlights the potential of HCI as an effective method for improving human-AI alignment.

6.7CLApr 9, 2025

LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections

Rita Sevastjanova, Robin Gerling, Thilo Spinner et al.

Large language models (LLMs) represent words through contextual word embeddings encoding different language properties like semantics and syntax. Understanding these properties is crucial, especially for researchers investigating language model capabilities, employing embeddings for tasks related to text similarity, or evaluating the reasons behind token importance as measured through attribution methods. Applications for embedding exploration frequently involve dimensionality reduction techniques, which reduce high-dimensional vectors to two dimensions used as coordinates in a scatterplot. This data transformation step introduces uncertainty that can be propagated to the visual representation and influence users' interpretation of the data. To communicate such uncertainties, we present LayerFlow - a visual analytics workspace that displays embeddings in an interlinked projection design and communicates the transformation, representation, and interpretation uncertainty. In particular, to hint at potential data distortions and uncertainties, the workspace includes several visual components, such as convex hulls showing 2D and HD clusters, data point pairwise distances, cluster summaries, and projection quality metrics. We show the usability of the presented workspace through replication and expert case studies that highlight the need to communicate uncertainty through multiple visual components and different data perspectives.

5.5CLJun 4, 2024

On Affine Homotopy between Language Encoders

Robin SM Chan, Reda Boumasmoud, Anej Svete et al.

Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the performance on downstream tasks. It is common to consider two encoders similar if they are \emph{homotopic}, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emph{affine} alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.

3.7HCJun 28, 2021

Communication Analysis through Visual Analytics: Current Practices, Challenges, and New Frontiers

Maximilian T. Fischer, Frederik L. Dennig, Daniel Seebacher et al.

The automated analysis of digital human communication data often focuses on specific aspects such as content or network structure in isolation. This can provide limited perspectives while making cross-methodological analyses, occurring in domains like investigative journalism, difficult. Communication research in psychology and the digital humanities instead stresses the importance of a holistic approach to overcome these limiting factors. In this work, we conduct an extensive survey on the properties of over forty semi-automated communication analysis systems and investigate how they cover concepts described in theoretical communication research. From these investigations, we derive a design space and contribute a conceptual framework based on communication research, technical considerations, and the surveyed approaches. The framework describes the systems' properties, capabilities, and composition through a wide range of criteria organized in the dimensions (1) Data, (2) Processing and Models, (3) Visual Interface, and (4) Knowledge Generation. These criteria enable a formalization of digital communication analysis through visual analytics, which, we argue, is uniquely suited for this task by tackling automation complexity while leveraging domain knowledge. With our framework, we identify shortcomings and research challenges, such as group communication dynamics, trust and privacy considerations, and holistic approaches. Simultaneously, our framework supports the evaluation of systems and promotes the mutual exchange between researchers through a structured common language, laying the foundations for future research on communication analysis.

7.2LGDec 8, 2020

An Empirical Study of Explainable AI Techniques on Deep Learning Models For Time Series Tasks

Udo Schlegel, Daniela Oelke, Daniel A. Keim et al.

Decision explanations of machine learning black-box models are often generated by applying Explainable AI (XAI) techniques. However, many proposed XAI methods produce unverified outputs. Evaluation and verification are usually achieved with a visual interpretation by humans on individual images or text. In this preregistration, we propose an empirical study and benchmark framework to apply attribution methods for neural networks developed for images and text data on time series. We present a methodology to automatically evaluate and rank attribution techniques on time series using perturbation methods to identify reliable approaches.

18.2HCOct 22, 2020

A Comparative Analysis of Industry Human-AI Interaction Guidelines

Austin P. Wright, Zijie J. Wang, Haekyu Park et al.

With the recent release of AI interaction guidelines from Apple, Google, and Microsoft, there is clearly interest in understanding the best practices in human-AI interaction. However, industry standards are not determined by a single company, but rather by the synthesis of knowledge from the whole community. We have surveyed all of the design guidelines from each of these major companies and developed a single, unified structure of guidelines, giving developers a centralized reference. We have then used this framework to compare each of the surveyed companies to find differences in areas of emphasis. Finally, we encourage people to contribute additional guidelines from other companies, academia, or individuals, to provide an open and extensible reference of AI design guidelines at https://ai-open-guidelines.readthedocs.io/.

9.6HCOct 18, 2020

Studying Visualization Guidelines According to Grounded Theory

Alexandra Diehl, Matthias Kraus, Alfie Abdul-Rahman et al.

Visualization guidelines, if defined properly, are invaluable to both practical applications and the theoretical foundation of visualization. In this paper, we present a collection of research activities for studying visualization guidelines according to Grounded Theory (GT). We used the discourses at VisGuides, which is an online discussion forum for visualization guidelines, as the main data source for enabling data-driven research processes as advocated by the grounded theory methodology. We devised a categorization scheme focusing on observing how visualization guidelines were featured in different threads and posts at VisGuides, and coded all 248 posts between September 27, 2017 (when VisGuides was first launched) and March 13, 2019. To complement manual categorization and coding, we used text analysis and visualization to help reveal patterns that may have been missed by the manual effort and summary statistics. To facilitate theoretical sampling and negative case analysis, we made an in-depth analysis of the 148 posts (with both questions and replies) related to a student assignment of a visualization course. Inspired by two discussion threads at VisGuides, we conducted two controlled empirical studies to collect further data to validate specific visualization guidelines. Through these activities guided by grounded theory, we have obtained some new findings about visualization guidelines.

15.7HCSep 14, 2020

Should We Trust (X)AI? Design Dimensions for Structured Experimental Evaluations

Fabian Sperrle, Mennatallah El-Assady, Grace Guo et al.

This paper systematically derives design dimensions for the structured evaluation of explainable artificial intelligence (XAI) approaches. These dimensions enable a descriptive characterization, facilitating comparisons between different study designs. They further structure the design space of XAI, converging towards a precise terminology required for a rigorous study of XAI. Our literature review differentiates between comparative studies and application papers, revealing methodological differences between the fields of machine learning, human-computer interaction, and visual analytics. Generally, each of these disciplines targets specific parts of the XAI process. Bridging the resulting gaps enables a holistic evaluation of XAI in real-world scenarios, as proposed by our conceptual model characterizing bias sources and trust-building. Furthermore, we identify and discuss the potential for future work based on observed research gaps that should lead to better coverage of the proposed model.

3.3HCSep 4, 2020

Augmenting Sheet Music with Rhythmic Fingerprints

Daniel Fürst, Matthias Miller, Daniel Keim et al.

In this paper, we bridge the gap between visualization and musicology by focusing on rhythm analysis tasks, which are tedious due to the complex visual encoding of the well-established Common Music Notation (CMN). Instead of replacing the CMN, we augment sheet music with rhythmic fingerprints to mitigate the complexity originating from the simultaneous encoding of musical features. The proposed visual design exploits music theory concepts such as the rhythm tree to facilitate the understanding of rhythmic information. Juxtaposing sheet music and the rhythmic fingerprints maintains the connection to the familiar representation. To investigate the usefulness of the rhythmic fingerprint design for identifying and comparing rhythmic patterns, we conducted a controlled user study with four experts and four novices. The results show that the rhythmic fingerprints enable novice users to recognize rhythmic patterns that only experts can identify using non-augmented sheet music.

20.8LGSep 16, 2019

Towards a Rigorous Evaluation of XAI Methods on Time Series

Udo Schlegel, Hiba Arnout, Mennatallah El-Assady et al.

Explainable Artificial Intelligence (XAI) methods are typically deployed to explain and debug black-box machine learning models. However, most proposed XAI methods are black-boxes themselves and designed for images. Thus, they rely on visual interpretability to evaluate and prove explanations. In this work, we apply XAI methods previously used in the image and text-domain on time series. We present a methodology to test and evaluate various XAI methods on time series by introducing new verification techniques to incorporate the temporal dimension. We further conduct preliminary experiments to assess the quality of selected XAI method explanations with various verification methods on a range of datasets and inspecting quality metrics on it. We demonstrate that in our initial experiments, SHAP works robust for all models, but others like DeepLIFT, LRP, and Saliency Maps work better with specific architectures.

12.0HCAug 21, 2019

Framing Visual Musicology through Methodology Transfer

Matthias Miller, Hanna Schäfer, Matthias Kraus et al.

In this position paper, we frame the field of Visual Musicology by providing an overview of well-established musicological sub-domains and their corresponding analytic and visualization tasks. To foster collaborative, interdisciplinary research, we discuss relevant data and domain characteristics. We give a description of the problem space, as well as the design space of musicology and discuss how existing problem-design mappings or solutions from other fields can be transferred to musicology. We argue that, through methodology transfer, established methods can be exploited to solve current musicological problems and show exemplary mappings from analytics fields related to text, geospatial, time-series, and other high-dimensional data to musicology. Finally, we point out open challenges, discuss research gaps, and highlight future research opportunities.

9.3HCAug 7, 2019

Speculative Execution for Guided Visual Analytics

Fabian Sperrle, Jürgen Bernard, Michael Sedlmair et al.

We propose the concept of Speculative Execution for Visual Analytics and discuss its effectiveness for model exploration and optimization. Speculative Execution enables the automatic generation of alternative, competing model configurations that do not alter the current model state unless explicitly confirmed by the user. These alternatives are computed based on either user interactions or model quality measures and can be explored using delta-visualizations. By automatically proposing modeling alternatives, systems employing Speculative Execution can shorten the gap between users and models, reduce the confirmation bias and speed up optimization processes. In this paper, we have assembled five application scenarios showcasing the potential of Speculative Execution, as well as a potential for further research.

21.4HCAug 1, 2019

Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections

Mennatallah El-Assady, Rebecca Kehlbeck, Christopher Collins et al.

We present a framework that allows users to incorporate the semantics of their domain knowledge for topic model refinement while remaining model-agnostic. Our approach enables users to (1) understand the semantic space of the model, (2) identify regions of potential conflicts and problems, and (3) readjust the semantic relation of concepts based on their understanding, directly influencing the topic modeling. These tasks are supported by an interactive visual analytics workspace that uses word-embedding projections to define concept regions which can then be refined. The user-refined concepts are independent of a particular document collection and can be transferred to related corpora. All user interactions within the concept space directly affect the semantic relations of the underlying vector space model, which, in turn, change the topic modeling. In addition to direct manipulation, our system guides the users' decision-making process through recommended interactions that point out potential improvements. This targeted refinement aims at minimizing the feedback required for an efficient human-in-the-loop process. We confirm the improvements achieved through our approach in two user studies that show topic model quality improvements through our visual knowledge externalization and learning process.

9.3HCJul 31, 2019

Augmenting Music Sheets with Harmonic Fingerprints

Matthias Miller, Alexandra Bonnici, Mennatallah El-Assady

Conventional Music Notation (CMN) is the well-established foundation for the written communication of musical information, such as rhythm, harmony, or timbre. However, CMN suffers from the complexity of its visual encoding and the need for extensive training to acquire proficiency and legibility. While alternative notations using additional visual variables (such as color to improve pitch identification) have been proposed, the music community does not readily accept notation systems that vary widely from the CMN. Therefore, to support student musicians in understanding the harmonic relationship of notes, instead of replacing the CMN, we present a visualization technique that augments a digital music sheet with a harmonic fingerprint glyph. Our design exploits the circle of fifths - a fundamental concept in music theory, as a visual metaphor. By attaching these visual glyphs to each bar of a selected composition we provide additional information about the salient harmonic features available in a musical piece. We conducted a user study to analyze the performance of experts and non-experts in an identification and comparison task of recurring patterns. The evaluation shows that the harmonic fingerprint supports these tasks without the need for close-reading, as when compared to a not-annotated music sheet.

38.1HCJul 29, 2019

explAIner: A Visual Analytics Framework for Interactive and Explainable Machine Learning

Thilo Spinner, Udo Schlegel, Hanna Schäfer et al.

We propose a framework for interactive and explainable machine learning that enables users to (1) understand machine learning models; (2) diagnose model limitations using different explainable AI methods; as well as (3) refine and optimize the models. Our framework combines an iterative XAI pipeline with eight global monitoring and steering mechanisms, including quality monitoring, provenance tracking, model comparison, and trust building. To operationalize the framework, we present explAIner, a visual analytics system for interactive and explainable machine learning that instantiates all phases of the suggested pipeline within the commonly used TensorBoard environment. We performed a user-study with nine participants across different expertise levels to examine their perception of our workflow and to collect suggestions to fill the gap between our system and framework. The evaluation confirms that our tightly integrated system leads to an informed machine learning process while disclosing opportunities for further extensions.

0.7CLJul 29, 2019

VIANA: Visual Interactive Annotation of Argumentation

Fabian Sperrle, Rita Sevastjanova, Rebecca Kehlbeck et al.

Argumentation Mining addresses the challenging tasks of identifying boundaries of argumentative text fragments and extracting their relationships. Fully automated solutions do not reach satisfactory accuracy due to their insufficient incorporation of semantics and domain knowledge. Therefore, experts currently rely on time-consuming manual annotations. In this paper, we present a visual analytics system that augments the manual annotation process by automatically suggesting which text fragments to annotate next. The accuracy of those suggestions is improved over time by incorporating linguistic knowledge and language modeling to learn a measure of argument similarity from user interactions. Based on a long-term collaboration with domain experts, we identify and model five high-level analysis tasks. We enable close reading and note-taking, annotation of arguments, argument reconstruction, extraction of argument relations, and exploration of argument graphs. To avoid context switches, we transition between all views through seamless morphing, visually anchoring all text- and graph-based layers. We evaluate our system with a two-stage expert user study based on a corpus of presidential debates. The results show that experts prefer our system over existing solutions due to the speedup provided by the automatic suggestions and the tight integration between text and graph views.

5.6IROct 25, 2018

Analyzing Visual Mappings of Traditional and Alternative Music Notation

Matthias Miller, Johannes Häußler, Matthias Kraus et al.

In this paper, we postulate that combining the domains of information visualization and music studies paves the ground for a more structured analysis of the design space of music notation, enabling the creation of alternative music notations that are tailored to different users and their tasks. Hence, we discuss the instantiation of a design and visualization pipeline for music notation that follows a structured approach, based on the fundamental concepts of information and data visualization. This enables practitioners and researchers of digital humanities and information visualization, alike, to conceptualize, create, and analyze novel music notation methods. Based on the analysis of relevant stakeholders and their usage of music notation as a mean of communication, we identify a set of relevant features typically encoded in different annotations and encodings, as used by interpreters, performers, and readers of music. We analyze the visual mappings of musical dimensions for varying notation methods to highlight gaps and frequent usages of encodings, visual channels, and Gestalt laws. This detailed analysis leads us to the conclusion that such an under-researched area in information visualization holds the potential for fundamental research. This paper discusses possible research opportunities, open challenges, and arguments that can be pursued in the process of analyzing, improving, or rethinking existing music notation systems and techniques.