IVAug 23, 2022Code
Unsupervised Anomaly Localization with Structural Feature-AutoencodersFelix Meissen, Johannes Paetzold, Georgios Kaissis et al.
Unsupervised Anomaly Detection has become a popular method to detect pathologies in medical images as it does not require supervision or labels for training. Most commonly, the anomaly detection model generates a "normal" version of an input image, and the pixel-wise $l^p$-difference of the two is used to localize anomalies. However, large residuals often occur due to imperfect reconstruction of the complex anatomical structures present in most medical images. This method also fails to detect anomalies that are not characterized by large intensity differences to the surrounding tissue. We propose to tackle this problem using a feature-mapping function that transforms the input intensity images into a space with multiple channels where anomalies can be detected along different discriminative feature maps extracted from the original image. We then train an Autoencoder model in this space using structural similarity loss that does not only consider differences in intensity but also in contrast and structure. Our method significantly increases performance on two medical data sets for brain MRI. Code and experiments are available at https://github.com/FeliMe/feature-autoencoder
CVDec 16, 2022
Biomedical image analysis competitions: The state of current participation practiceMatthias Eisenmann, Annika Reinke, Vivienn Weru et al. · utoronto
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
CVMar 1, 2023Code
Unsupervised Pathology Detection: A Deep Dive Into the State of the ArtIoannis Lagogiannis, Felix Meissen, Georgios Kaissis et al.
Deep unsupervised approaches are gathering increased attention for applications such as pathology detection and segmentation in medical images since they promise to alleviate the need for large labeled datasets and are more generalizable than their supervised counterparts in detecting any kind of rare pathology. As the Unsupervised Anomaly Detection (UAD) literature continuously grows and new paradigms emerge, it is vital to continuously evaluate and benchmark new methods in a common framework, in order to reassess the state-of-the-art (SOTA) and identify promising research directions. To this end, we evaluate a diverse selection of cutting-edge UAD methods on multiple medical datasets, comparing them against the established SOTA in UAD for brain MRI. Our experiments demonstrate that newly developed feature-modeling methods from the industrial and medical literature achieve increased performance compared to previous work and set the new SOTA in a variety of modalities and datasets. Additionally, we show that such methods are capable of benefiting from recently developed self-supervised pre-training algorithms, further increasing their performance. Finally, we perform a series of experiments in order to gain further insights into some unique characteristics of selected models and datasets. Our code can be found under https://github.com/iolag/UPD_study/.
CVMar 3, 2023Code
Robust Detection Outcome: A Metric for Pathology Detection in Medical ImagesFelix Meissen, Philip Müller, Georgios Kaissis et al.
Detection of pathologies is a fundamental task in medical imaging and the evaluation of algorithms that can perform this task automatically is crucial. However, current object detection metrics for natural images do not reflect the specific clinical requirements in pathology detection sufficiently. To tackle this problem, we propose Robust Detection Outcome (RoDeO); a novel metric for evaluating algorithms for pathology detection in medical images, especially in chest X-rays. RoDeO evaluates different errors directly and individually, and reflects clinical needs better than current metrics. Extensive evaluation on the ChestX-ray8 dataset shows the superiority of our metrics compared to existing ones. We released the code at https://github.com/FeliMe/RoDeO and published RoDeO as pip package (rodeometric).
LGDec 31, 2022
Approaching Peak Ground TruthFlorian Kofler, Johannes Wahle, Ivan Ezhov et al.
Machine learning models are typically evaluated by computing similarity with reference annotations and trained by maximizing similarity with such. Especially in the biomedical domain, annotations are subjective and suffer from low inter- and intra-rater reliability. Since annotations only reflect one interpretation of the real world, this can lead to sub-optimal predictions even though the model achieves high similarity scores. Here, the theoretical concept of PGT is introduced. PGT marks the point beyond which an increase in similarity with the \emph{reference annotation} stops translating to better RWMP. Additionally, a quantitative technique to approximate PGT by computing inter- and intra-rater reliability is proposed. Finally, four categories of PGT-aware strategies to evaluate and improve model performance are reviewed.
CLNov 14, 2025Code
NOVA: An Agentic Framework for Automated Histopathology Analysis and DiscoveryAnurag J. Vaidya, Felix Meissen, Daniel C. Castro et al.
Digitized histopathology analysis involves complex, time-intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) built on open-source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90-question benchmark -- verified by pathologists and biomedical scientists -- spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi-step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding-agent baselines, and a pathologist-verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.
CVSep 5, 2023
Anatomy-Driven Pathology Detection on Chest X-raysPhilip Müller, Felix Meissen, Johannes Brandt et al.
Pathology detection and delineation enables the automatic interpretation of medical scans such as chest X-rays while providing a high level of explainability to support radiologists in making informed decisions. However, annotating pathology bounding boxes is a time-consuming task such that large public datasets for this purpose are scarce. Current approaches thus use weakly supervised object detection to learn the (rough) localization of pathologies from image-level annotations, which is however limited in performance due to the lack of bounding box supervision. We therefore propose anatomy-driven pathology detection (ADPD), which uses easy-to-annotate bounding boxes of anatomical regions as proxies for pathologies. We study two training approaches: supervised training using anatomy-level pathology labels and multiple instance learning (MIL) with image-level pathology labels. Our results show that our anatomy-level training approach outperforms weakly supervised methods and fully supervised detection with limited training samples, and our MIL approach is competitive with both baseline approaches, therefore demonstrating the potential of our approach.
LGSep 25, 2023
(Predictable) Performance Bias in Unsupervised Anomaly DetectionFelix Meissen, Svenja Breuer, Moritz Knolle et al.
Background: With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored. Methods: In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced a novel subgroup-AUROC (sAUROC) metric, which aids in quantifying fairness in machine learning. Findings: Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups. Interpretation: Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical fairness laws discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition.
CVFeb 19, 2024Code
Weakly Supervised Object Detection in Chest X-Rays with Differentiable ROI Proposal Networks and Soft ROI PoolingPhilip Müller, Felix Meissen, Georgios Kaissis et al.
Weakly supervised object detection (WSup-OD) increases the usefulness and interpretability of image classification algorithms without requiring additional supervision. The successes of multiple instance learning in this task for natural images, however, do not translate well to medical images due to the very different characteristics of their objects (i.e. pathologies). In this work, we propose Weakly Supervised ROI Proposal Networks (WSRPN), a new method for generating bounding box proposals on the fly using a specialized region of interest-attention (ROI-attention) module. WSRPN integrates well with classic backbone-head classification algorithms and is end-to-end trainable with only image-label supervision. We experimentally demonstrate that our new method outperforms existing methods in the challenging task of disease localization in chest X-ray images. Code: https://github.com/philip-mueller/wsrpn
LGJul 17, 2025Code
Insights into a radiology-specialised multimodal large language model with sparse autoencodersKenza Bouzid, Shruthi Bannur, Felix Meissen et al. · cambridge, microsoft-research
Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: https://huggingface.co/microsoft/maira-2-sae.
CLJun 24, 2024Code
Evaluation of Language Models in the Medical Context Under Resource-Constrained SettingsAndrea Posada, Daniel Rueckert, Felix Meissen et al.
Since the Transformer architecture emerged, language model development has grown, driven by their promising potential. Releasing these models into production requires properly understanding their behavior, particularly in sensitive domains like medicine. Despite this need, the medical literature still lacks practical assessment of pre-trained language models, which are especially valuable in settings where only consumer-grade computational resources are available. To address this gap, we have conducted a comprehensive survey of language models in the medical field and evaluated a subset of these for medical text classification and conditional text generation. The subset includes 53 models with 110 million to 13 billion parameters, spanning the Transformer-based model families and knowledge domains. Different approaches are employed for text classification, including zero-shot learning, enabling tuning without the need to train the model. These approaches are helpful in our target settings, where many users of language models find themselves. The results reveal remarkable performance across the tasks and datasets evaluated, underscoring the potential of certain models to contain medical knowledge, even without domain specialization. This study thus advocates for further exploration of model applications in medical contexts, particularly in computational resource-constrained settings, to benefit a wide range of users. The code is available on https://github.com/anpoc/Language-models-in-medicine.
CVDec 6, 2023Code
How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly DetectionFelix Meissen, Johannes Getzner, Alexander Ziller et al.
Unsupervised anomaly detection (UAD) alleviates large labeling efforts by training exclusively on unlabeled in-distribution data and detecting outliers as anomalies. Generally, the assumption prevails that large training datasets allow the training of higher-performing UAD models. However, in this work, we show that UAD with extremely few training samples can already match -- and in some cases even surpass -- the performance of training with the whole training dataset. Building upon this finding, we propose an unsupervised method to reliably identify prototypical samples to further boost UAD performance. We demonstrate the utility of our method on seven different established UAD benchmarks from computer vision, industrial defect detection, and medicine. With just 25 selected samples, we even exceed the performance of full training in $25/67$ categories in these benchmarks. Additionally, we show that the prototypical in-distribution samples identified by our proposed method generalize well across models and datasets and that observing their sample selection criteria allows for a successful manual selection of small subsets of high-performing samples. Our code is available at https://anonymous.4open.science/r/uad_prototypical_samples/
IVFeb 8, 2022Code
On the Pitfalls of Using the Residual Error as Anomaly ScoreFelix Meissen, Benedikt Wiestler, Georgios Kaissis et al.
Many current state-of-the-art methods for anomaly localization in medical images rely on calculating a residual image between a potentially anomalous input image and its "healthy" reconstruction. As the reconstruction of the unseen anomalous region should be erroneous, this yields large residuals as a score to detect anomalies in medical images. However, this assumption does not take into account residuals resulting from imperfect reconstructions of the machine learning models used. Such errors can easily overshadow residuals of interest and therefore strongly question the use of residual images as scoring function. Our work explores this fundamental problem of residual images in detail. We theoretically define the problem and thoroughly evaluate the influence of intensity and texture of anomalies against the effect of imperfect reconstructions in a series of experiments. Code and experiments are available under https://github.com/FeliMe/residual-score-pitfalls
IVApr 23, 2024
Interactive Generation of Laparoscopic Videos with Diffusion ModelsIvan Iliash, Simeon Allmendinger, Felix Meissen et al.
Generative AI, in general, and synthetic visual data generation, in specific, hold much promise for benefiting surgical training by providing photorealism to simulation environments. Current training methods primarily rely on reading materials and observing live surgeries, which can be time-consuming and impractical. In this work, we take a significant step towards improving the training process. Specifically, we use diffusion models in combination with a zero-shot video diffusion method to interactively generate realistic laparoscopic images and videos by specifying a surgical action through text and guiding the generation with tool positions through segmentation masks. We demonstrate the performance of our approach using the publicly available Cholec dataset family and evaluate the fidelity and factual correctness of our generated images using a surgical action recognition model as well as the pixel-wise F1-score for the spatial control of tool generation. We achieve an FID of 38.097 and an F1-score of 0.71.
CLJun 6, 2024
MAIRA-2: Grounded Radiology Report GenerationShruthi Bannur, Kenza Bouzid, Daniel C. Castro et al.
Radiology reporting is a complex task requiring detailed medical image understanding and precise language generation, for which generative multimodal models offer a promising solution. However, to impact clinical practice, models must achieve a high level of both verifiable performance and utility. We augment the utility of automated report generation by incorporating localisation of individual findings on the image - a task we call grounded report generation - and enhance performance by incorporating realistic reporting context as inputs. We design a novel evaluation framework (RadFact) leveraging the logical inference capabilities of large language models (LLMs) to quantify report correctness and completeness at the level of individual sentences, while supporting the new task of grounded reporting. We develop MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding. MAIRA-2 achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.
IVMay 15, 2023
The Brain Tumor Segmentation (BraTS) Challenge 2023: Brain MR Image Synthesis for Tumor Segmentation (BraSyn)Hongwei Bran Li, Gian Marco Conte, Qingqiao Hu et al.
Automated brain tumor segmentation methods have become well-established and reached performance levels offering clear clinical utility. These methods typically rely on four input magnetic resonance imaging (MRI) modalities: T1-weighted images with and without contrast enhancement, T2-weighted images, and FLAIR images. However, some sequences are often missing in clinical practice due to time constraints or image artifacts, such as patient motion. Consequently, the ability to substitute missing modalities and gain segmentation performance is highly desirable and necessary for the broader adoption of these algorithms in the clinical routine. In this work, we present the establishment of the Brain MR Image Synthesis Benchmark (BraSyn) in conjunction with the Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2023. The primary objective of this challenge is to evaluate image synthesis methods that can realistically generate missing MRI modalities when multiple available images are provided. The ultimate aim is to facilitate automated brain tumor segmentation pipelines. The image dataset used in the benchmark is diverse and multi-modal, created through collaboration with various hospitals and research institutions.
IVMay 15, 2023
The Brain Tumor Segmentation (BraTS) Challenge: Local Synthesis of Healthy Brain Tissue via InpaintingFlorian Kofler, Felix Meissen, Felix Steinbauer et al.
A myriad of algorithms for the automatic analysis of brain MR images is available to support clinicians in their decision-making. For brain tumor patients, the image acquisition time series typically starts with an already pathological scan. This poses problems, as many algorithms are designed to analyze healthy brains and provide no guarantee for images featuring lesions. Examples include, but are not limited to, algorithms for brain anatomy parcellation, tissue segmentation, and brain extraction. To solve this dilemma, we introduce the BraTS inpainting challenge. Here, the participants explore inpainting techniques to synthesize healthy brain scans from lesioned ones. The following manuscript contains the task formulation, dataset, and submission procedure. Later, it will be updated to summarize the findings of the challenge. The challenge is organized as part of the ASNR-BraTS MICCAI challenge.
IVJan 24, 2022
AutoSeg -- Steering the Inductive Biases for Automatic Pathology SegmentationFelix Meissen, Georgios Kaissis, Daniel Rueckert
In medical imaging, un-, semi-, or self-supervised pathology detection is often approached with anomaly- or out-of-distribution detection methods, whose inductive biases are not intentionally directed towards detecting pathologies, and are therefore sub-optimal for this task. To tackle this problem, we propose AutoSeg, an engine that can generate diverse artificial anomalies that resemble the properties of real-world pathologies. Our method can accurately segment unseen artificial anomalies and outperforms existing methods for pathology detection on a challenging real-world dataset of Chest X-ray images. We experimentally evaluate our method on the Medical Out-of-Distribution Analysis Challenge 2021.