LGAug 15, 2024Code
Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution ExplainabilityJoakim Edin, Andreas Geert Motzfeldt, Casper L. Christensen et al.
Deep neural network predictions are notoriously difficult to interpret. Feature attribution methods aim to explain these predictions by identifying the contribution of each input feature. Faithfulness, often evaluated using the area over the perturbation curve (AOPC), reflects feature attributions' accuracy in describing the internal mechanisms of deep neural networks. However, many studies rely on AOPC to compare faithfulness across different models, which we show can lead to false conclusions about models' faithfulness. Specifically, we find that AOPC is sensitive to variations in the model, resulting in unreliable cross-model comparisons. Moreover, AOPC scores are difficult to interpret in isolation without knowing the model-specific lower and upper limits. To address these issues, we propose a normalization approach, Normalized AOPC (NAOPC), enabling consistent cross-model evaluations and more meaningful interpretation of individual scores. Our experiments demonstrate that this normalization can radically change AOPC results, questioning the conclusions of earlier studies and offering a more robust framework for assessing feature attribution faithfulness. Our code is available at https://github.com/JoakimEdin/naopc.
CYJan 23Code
Measuring Individual User Fairness with User Similarity and Effectiveness DisparityTheresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo et al.
Individual user fairness is commonly understood as treating similar users similarly. In Recommender Systems (RSs), several evaluation measures exist for quantifying individual user fairness. These measures evaluate fairness via either: (i) the disparity in RS effectiveness scores regardless of user similarity, or (ii) the disparity in items recommended to similar users regardless of item relevance. Both disparity in recommendation effectiveness and user similarity are very important in fairness, yet no existing individual user fairness measure simultaneously accounts for both. In brief, current user fairness evaluation measures implement a largely incomplete definition of fairness. To fill this gap, we present Pairwise User unFairness (PUF), a novel evaluation measure of individual user fairness that considers both effectiveness disparity and user similarity. PUF is the only measure that can express this important distinction. We empirically validate that PUF does this consistently across 4 datasets and 7 rankers, and robustly when varying user similarity or effectiveness. In contrast, all other measures are either almost insensitive to effectiveness disparity or completely insensitive to user similarity. We contribute the first RS evaluation measure to reliably capture both user similarity and effectiveness in individual user fairness. Our code: https://github.com/theresiavr/PUF-individual-user-fairness-recsys.
LGApr 21, 2023
Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability StudyJoakim Edin, Alexander Junge, Jakob D. Havtorn et al.
Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. In previous work, the macro F1 score has been calculated sub-optimally, and our correction doubles it. We contribute a revised model comparison using stratified sampling and identical experimental setups, including hyperparameters and decision boundary tuning. We analyze prediction errors to validate and falsify assumptions of previous works. The analysis confirms that all models struggle with rare codes, while long documents only have a negligible impact. Finally, we present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models. We release our code, model parameters, and new MIMIC-III and MIMIC-IV training and evaluation pipelines to accommodate fair future comparisons.
CYOct 29, 2025
The Quest for Reliable Metrics of Responsible AITheresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo et al.
The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.
IRAug 29, 2025Code
Stairway to Fairness: Connecting Group and Individual FairnessTheresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo et al.
Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: https://github.com/theresiavr/stairway-to-fairness.
CLMay 23, 2025Code
GIM: Improved Interpretability for Large Language ModelsJoakim Edin, Róbert Csordás, Tuukka Ruotsalo et al.
Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.
CLMar 27, 2025Code
As easy as PIE: understanding when pruning causes language models to disagreePietro Tropeano, Maria Maistro, Tuukka Ruotsalo et al.
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE
HCMay 15, 2024
Modeling User Preferences via Brain-Computer InterfacingLuis A. Leiva, V. Javier Traver, Alexandra Kawala-Sterniuk et al.
Present Brain-Computer Interfacing (BCI) technology allows inference and detection of cognitive and affective states, but fairly little has been done to study scenarios in which such information can facilitate new applications that rely on modeling human cognition. One state that can be quantified from various physiological signals is attention. Estimates of human attention can be used to reveal preferences and novel dimensions of user experience. Previous approaches have tackled these incredibly challenging tasks using a variety of behavioral signals, from dwell-time to click-through data, and computational models of visual correspondence to these behavioral signals. However, behavioral signals are only rough estimations of the real underlying attention and affective preferences of the users. Indeed, users may attend to some content simply because it is salient, but not because it is really interesting, or simply because it is outrageous. With this paper, we put forward a research agenda and example work using BCI to infer users' preferences, their attentional correlates towards visual content, and their associations with affective experience. Subsequently, we link these to relevant applications, such as information retrieval, personalized steering of generative models, and crowdsourcing population estimates of affective experiences.
CLFeb 24, 2024
Query Augmentation by Decoding Semantics from Brain SignalsZiyi Ye, Jingtao Zhan, Qingyao Ai et al. · tsinghua
Query augmentation is a crucial technique for refining semantically imprecise queries. Traditionally, query augmentation relies on extracting information from initially retrieved, potentially relevant documents. If the quality of the initially retrieved documents is low, then the effectiveness of query augmentation would be limited as well. We propose Brain-Aug, which enhances a query by incorporating semantic information decoded from brain signals. BrainAug generates the continuation of the original query with a prompt constructed with brain signal information and a ranking-oriented inference approach. Experimental results on fMRI (functional magnetic resonance imaging) datasets show that Brain-Aug produces semantically more accurate queries, leading to improved document ranking performance. Such improvement brought by brain signals is particularly notable for ambiguous queries.
CVJun 11, 2025
Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without LabelsJonathan Grizou, Carlos de la Torre-Ortiz, Tuukka Ruotsalo
We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).
LGJun 13, 2024
An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare RecordsJoakim Edin, Maria Maistro, Lars Maaløe et al.
Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves healthcare professionals hesitant to trust them. State-of-the-art explainability methods increase model transparency but rely on human-annotated evidence spans, which are costly. In this study, we propose an approach to produce plausible and faithful explanations without needing such annotations. We demonstrate on the automated medical coding task that adversarial robustness training improves explanation plausibility and introduce AttInGrad, a new explanation method superior to previous ones. By combining both contributions in a fully unsupervised setup, we produce explanations of comparable quality, or better, to that of a supervised approach. We release our code and model weights.
CLNov 16, 2023
BrainLLM: Generative Language Decoding from Brain RecordingsZiyi Ye, Qingyao Ai, Yiqun Liu et al.
Generating human language through non-invasive brain-computer interfaces (BCIs) has the potential to unlock many applications, such as serving disabled patients and improving communication. Currently, however, generating language via BCIs has been previously successful only within a classification setup for selecting pre-generated sentence continuation candidates with the most likely cortical semantic representation. Inspired by recent research that revealed associations between the brain and the large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.
HCAug 2, 2021
Interactive Visual Facets to Support Fluid Exploratory SearchChen He, Luana Micallef, Barış Serim et al.
Exploratory search starts with ill-defined goals and involves browsing, learning, and formulating new targets for search. To fluidly support such dynamic search behaviours, we focus on devising interactive visual facets (IVF), visualising information facets to support user comprehension and control of the information space. To do this, we reviewed existing faceted search interfaces and derived two design requirements (DR) that have not been fully addressed to support fluid interactions in exploratory search. We then exemplified the requirements through devising an IVF tool, which coordinates a linear and a categorical facet representing the distribution and summarisation of items, respectively, and providing context for faceted exploration (DR1). To support rapid transitions between search criteria (DR2), the tool introduces a novel design concept of using facets to select items without filtering the item space. Particularly, we propose a filter-swipe technique that enables users to drag a categorical facet value sequentially over linear facet bars to view the items in the intersection of the two facets along with the categorical facet dynamically summarizing the items in the interaction. Three applications demonstrate how the features support information discovery with ease. A user study of 11 participants with realistic email search tasks shows that dynamic suggestions through the timeline navigation can help discover useful suggestions for search; the novel design concept was favoured over using facet values as filters. Based on these practices, we derive IVF design implications for fluid, exploratory searches.
IRJul 12, 2016
Natural brain-information interfaces: Recommending information by relevance inferred from human brain signalsManuel J. A. Eugster, Tuukka Ruotsalo, Michiel M. Spapé et al.
Finding relevant information from large document collections such as the World Wide Web is a common task in our daily lives. Estimation of a user's interest or search intention is necessary to recommend and retrieve relevant information from these collections. We introduce a brain-information interface used for recommending information by relevance inferred directly from brain signals. In experiments, participants were asked to read Wikipedia documents about a selection of topics while their EEG was recorded. Based on the prediction of word relevance, the individual's search intent was modeled and successfully used for retrieving new, relevant documents from the whole English Wikipedia corpus. The results show that the users' interests towards digital content can be modeled from the brain signals evoked by reading. The introduced brain-relevance paradigm enables the recommendation of information without any explicit user interaction, and may be applied across diverse information-intensive applications.