CYAug 11, 2023
FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcareKarim Lekadir, Aasa Feragen, Abdul Joseph Fofanah et al. · eth-zurich
Despite major advances in artificial intelligence (AI) for medicine and healthcare, the deployment and adoption of AI technologies remain limited in real-world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical and legal risks associated with medical AI. To increase real world adoption, it is essential that medical AI tools are trusted and accepted by patients, clinicians, health organisations and authorities. This work describes the FUTURE-AI guideline as the first international consensus framework for guiding the development and deployment of trustworthy AI tools in healthcare. The FUTURE-AI consortium was founded in 2021 and currently comprises 118 inter-disciplinary experts from 51 countries representing all continents, including AI scientists, clinicians, ethicists, and social scientists. Over a two-year period, the consortium defined guiding principles and best practices for trustworthy AI through an iterative process comprising an in-depth literature review, a modified Delphi survey, and online consensus meetings. The FUTURE-AI framework was established based on 6 guiding principles for trustworthy AI in healthcare, i.e. Fairness, Universality, Traceability, Usability, Robustness and Explainability. Through consensus, a set of 28 best practices were defined, addressing technical, clinical, legal and socio-ethical dimensions. The recommendations cover the entire lifecycle of medical AI, from design, development and validation to regulation, deployment, and monitoring. FUTURE-AI is a risk-informed, assumption-free guideline which provides a structured approach for constructing medical AI tools that will be trusted, deployed and adopted in real-world practice. Researchers are encouraged to take the recommendations into account in proof-of-concept stages to facilitate future translation towards clinical practice of medical AI.
LGSep 7, 2023Code
Insights Into the Inner Workings of Transformer Models for Protein Function PredictionMarkus Wenzel, Erik Grüner, Nils Strodthoff
Motivation: We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. Results: The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins. Availability and Implementation: Source code can be accessed at https://github.com/markuswenzel/xai-proteins .
IVApr 11, 2022
From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in HistopathologyMaximilian Springenberg, Annika Frommholz, Markus Wenzel et al.
While machine learning is currently transforming the field of histopathology, the domain lacks a comprehensive evaluation of state-of-the-art models based on essential but complementary quality requirements beyond a mere classification accuracy. In order to fill this gap, we developed a new methodology to extensively evaluate a wide range of classification models, including recent vision transformers, and convolutional neural networks such as: ConvNeXt, ResNet (BiT), Inception, ViT and Swin transformer, with and without supervised or self-supervised pretraining. We thoroughly tested the models on five widely used histopathology datasets containing whole slide images of breast, gastric, and colorectal cancer and developed a novel approach using an image-to-image translation model to assess the robustness of a cancer classification model against stain variations. Further, we extended existing interpretability methods to previously unstudied models and systematically reveal insights of the models' classifications strategies that can be transferred to future model architectures.
13.3CLApr 16
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse EventsRaunak Agarwal, Markus Wenzel, Simon Baur et al.
Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.
CVJul 8, 2022
Generative Adversarial Networks and Other Generative ModelsMarkus Wenzel
Generative networks are fundamentally different in their aim and methods compared to CNNs for classification, segmentation, or object detection. They have initially not been meant to be an image analysis tool, but to produce naturally looking images. The adversarial training paradigm has been proposed to stabilize generative methods, and has proven to be highly successful -- though by no means from the first attempt. This chapter gives a basic introduction into the motivation for Generative Adversarial Networks (GANs) and traces the path of their success by abstracting the basic task and working mechanism, and deriving the difficulty of early practical approaches. Methods for a more stable training will be shown, and also typical signs for poor convergence and their reasons. Though this chapter focuses on GANs that are meant for image generation and image analysis, the adversarial training paradigm itself is not specific to images, and also generalizes to tasks in image analysis. Examples of architectures for image semantic segmentation and abnormality detection will be acclaimed, before contrasting GANs with further generative modeling approaches lately entering the scene. This will allow a contextualized view on the limits but also benefits of GANs.
6.0AIMay 13
Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital TwinMarkus Wenzel, Tobias Strapatsas, Jessika Kress et al.
Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.
IVDec 27, 2021Code
Infant Brain Age Classification: 2D CNN Outperforms 3D CNN in Small DatasetMahdieh Shabanian, Markus Wenzel, John P. DeVincenzo
Determining if the brain is developing normally is a key component of pediatric neuroradiology and neurology. Brain magnetic resonance imaging (MRI) of infants demonstrates a specific pattern of development beyond simply myelination. While radiologists have used myelination patterns, brain morphology and size characteristics to determine age-adequate brain maturity, this requires years of experience in pediatric neuroradiology. With no standardized criteria, visual estimation of the structural maturity of the brain from MRI before three years of age remains dominated by inter-observer and intra-observer variability. A more objective estimation of brain developmental age could help physicians identify many neurodevelopmental conditions and diseases earlier and more reliably. Such data, however, is naturally hard to obtain, and the observer ground truth not much of a gold standard due to subjectivity of assessment. In this light, we explore the general feasibility to tackle this task, and the utility of different approaches, including two- and three-dimensional convolutional neural networks (CNN) that were trained on a fusion of T1-weighted, T2-weighted, and proton density (PD) weighted sequences from 84 individual subjects divided into four age groups from birth to 3 years of age. In the best performing approach, we achieved an accuracy of 0.90 [95% CI:0.86-0.94] using a 2D CNN on a central axial thick slab. We discuss the comparison to 3D networks and show how the performance compares to the use of only one sequence (T1w). In conclusion, despite the theoretical superiority of 3D CNN approaches, in limited-data situations, such approaches are inferior to simpler architectures. The code can be found in https://github.com/shabanian2018/Age_MRI-Classification
QMApr 16, 2021
Predicting the Binding of SARS-CoV-2 Peptides to the Major Histocompatibility Complex with Recurrent Neural NetworksJohanna Vielhaben, Markus Wenzel, Eva Weicken et al.
Predicting the binding of viral peptides to the major histocompatibility complex with machine learning can potentially extend the computational immunology toolkit for vaccine development, and serve as a key component in the fight against a pandemic. In this work, we adapt and extend USMPep, a recently proposed, conceptually simple prediction algorithm based on recurrent neural networks. Most notably, we combine regressors (binding affinity data) and classifiers (mass spectrometry data) from qualitatively different data sources to obtain a more comprehensive prediction tool. We evaluate the performance on a recently released SARS-CoV-2 dataset with binding stability measurements. USMPep not only sets new benchmarks on selected single alleles, but consistently turns out to be among the best-performing methods or, for some metrics, to be even the overall best-performing method for this task.
QMApr 22, 2020
Risk Estimation of SARS-CoV-2 Transmission from Bluetooth Low Energy MeasurementsFelix Sattler, Jackie Ma, Patrick Wagner et al.
Digital contact tracing approaches based on Bluetooth low energy (BLE) have the potential to efficiently contain and delay outbreaks of infectious diseases such as the ongoing SARS-CoV-2 pandemic. In this work we propose a novel machine learning based approach to reliably detect subjects that have spent enough time in close proximity to be at risk of being infected. Our study is an important proof of concept that will aid the battery of epidemiological policies aiming to slow down the rapid spread of COVID-19.
AISep 13, 2018
Focus Group on Artificial Intelligence for HealthMarcel Salathé, Thomas Wiegand, Markus Wenzel
Artificial Intelligence (AI) - the phenomenon of machines being able to solve problems that require human intelligence - has in the past decade seen an enormous rise of interest due to significant advances in effectiveness and use. The health sector, one of the most important sectors for societies and economies worldwide, is particularly interesting for AI applications, given the ongoing digitalisation of all types of health information. The potential for AI assistance in the health domain is immense, because AI can support medical decision making at reduced costs, everywhere. However, due to the complexity of AI algorithms, it is difficult to distinguish good from bad AI-based solutions and to understand their strengths and weaknesses, which is crucial for clarifying responsibilities and for building trust. For this reason, the International Telecommunication Union (ITU) has established a new Focus Group on "Artificial Intelligence for Health" (FG-AI4H) in partnership with the World Health Organization (WHO). Health and care services are usually the responsibility of a government - even when provided through private insurance systems - and thus under the responsibility of WHO/ITU member states. FG-AI4H will identify opportunities for international standardization, which will foster the application of AI to health issues on a global scale. In particular, it will establish a standardized assessment framework with open benchmarks for the evaluation of AI-based methods for health, such as AI-based diagnosis, triage or treatment decisions.