CVJul 9, 2022
Explaining Chest X-ray Pathologies in Natural LanguageMaxime Kayser, Cornelius Emde, Oana-Maria Camburu et al.
Most deep learning algorithms lack explanations for their predictions, which limits their deployment in clinical practice. Approaches to improve explainability, especially in medical imaging, have often been shown to convey limited information, be overly reassuring, or lack robustness. In this work, we introduce the task of generating natural language explanations (NLEs) to justify predictions made on medical images. NLEs are human-friendly and comprehensive, and enable the training of intrinsically explainable models. To this goal, we introduce MIMIC-NLE, the first, large-scale, medical imaging dataset with NLEs. It contains over 38,000 NLEs, which explain the presence of various thoracic pathologies and chest X-ray findings. We propose a general approach to solve the task and evaluate several architectures on this dataset, including via clinician assessment.
CVDec 21, 2025
brat: Aligned Multi-View Embeddings for Brain MRI AnalysisMaxime Kayser, Maksim Gridnev, Wanting Wang et al.
We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.
CVMay 8, 2021Code
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language TasksMaxime Kayser, Oana-Maria Camburu, Leonard Salewski et al.
Recently, there has been an increasing number of efforts to introduce models capable of generating natural language explanations (NLEs) for their predictions on vision-language (VL) tasks. Such models are appealing, because they can provide human-friendly and comprehensive explanations. However, there is a lack of comparison between existing methods, which is due to a lack of re-usable evaluation frameworks and a scarcity of datasets. In this work, we introduce e-ViL and e-SNLI-VE. e-ViL is a benchmark for explainable vision-language tasks that establishes a unified evaluation framework and provides the first comprehensive comparison of existing approaches that generate NLEs for VL tasks. It spans four models and three datasets and both automatic metrics and human evaluation are used to assess model-generated explanations. e-SNLI-VE is currently the largest existing VL dataset with NLEs (over 430k instances). We also propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model that is well-suited for text generation. It surpasses the previous state of the art by a large margin across all datasets. Code and data are available here: https://github.com/maximek3/e-ViL.
HCOct 16, 2024
Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support SettingMaxime Kayser, Bayar Menzat, Cornelius Emde et al.
The growing capabilities of AI models are leading to their wider use, including in safety-critical domains. Explainable AI (XAI) aims to make these models safer to use by making their inference process more transparent. However, current explainability methods are seldom evaluated in the way they are intended to be used: by real-world end users. To address this, we conducted a large-scale user study with 85 healthcare practitioners in the context of human-AI collaborative chest X-ray analysis. We evaluated three types of explanations: visual explanations (saliency maps), natural language explanations, and a combination of both modalities. We specifically examined how different explanation types influence users depending on whether the AI advice and explanations are factually correct. We find that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps. We also observe that the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.
CLFeb 26, 2025
Shh, don't say that! Domain Certification in LLMsCornelius Emde, Alasdair Paren, Preetham Arvind et al.
Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.
LGFeb 7, 2020
Understanding the effects of artifacts on automated polyp detection and incorporating that knowledge via learning without forgettingMaxime Kayser, Roger D. Soberanis-Mukul, Anna-Maria Zvereva et al.
Survival rates for colorectal cancer are higher when polyps are detected at an early stage and can be removed before they develop into malignant tumors. Automated polyp detection, which is dominated by deep learning based methods, seeks to improve early detection of polyps. However, current efforts rely heavily on the size and quality of the training datasets. The quality of these datasets often suffers from various image artifacts that affect the visibility and hence, the detection rate. In this work, we conducted a systematic analysis to gain a better understanding of how artifacts affect automated polyp detection. We look at how six different artifact classes, and their location in an image, affect the performance of a RetinaNet based polyp detection model. We found that, depending on the artifact class, they can either benefit or harm the polyp detector. For instance, bubbles are often misclassified as polyps, while specular reflections inside of a polyp region can improve detection capabilities. We then investigated different strategies, such as a learning without forgetting framework, to leverage artifact knowledge to improve automated polyp detection. Our results show that such models can mitigate some of the harmful effects of artifacts, but require more work to significantly improve polyp detection capabilities.