Simon Baur

CL
h-index12
4papers
3citations
Novelty46%
AI Score44

4 Papers

CLApr 16
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Raunak Agarwal, Markus Wenzel, Simon Baur et al.

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

LGNov 7, 2025
Multimodal Deep Learning for Prediction of Progression-Free Survival in Patients with Neuroendocrine Tumors Undergoing 177Lu-based Peptide Receptor Radionuclide Therapy

Simon Baur, Tristan Ruhwedel, Ekin Böke et al.

Peptide receptor radionuclide therapy (PRRT) is an established treatment for metastatic neuroendocrine tumors (NETs), yet long-term disease control occurs only in a subset of patients. Predicting progression-free survival (PFS) could support individualized treatment planning. This study evaluates laboratory, imaging, and multimodal deep learning models for PFS prediction in PRRT-treated patients. In this retrospective, single-center study 116 patients with metastatic NETs undergoing 177Lu-DOTATOC were included. Clinical characteristics, laboratory values, and pretherapeutic somatostatin receptor positron emission tomography/computed tomographies (SR-PET/CT) were collected. Seven models were trained to classify low- vs. high-PFS groups, including unimodal (laboratory, SR-PET, or CT) and multimodal fusion approaches. Explainability was evaluated by feature importance analysis and gradient maps. Forty-two patients (36%) had short PFS (< 1 year), 74 patients long PFS (>1 year). Groups were similar in most characteristics, except for higher baseline chromogranin A (p = 0.003), elevated gamma-GT (p = 0.002), and fewer PRRT cycles (p < 0.001) in short-PFS patients. The Random Forest model trained only on laboratory biomarkers reached an AUROC of 0.59 +- 0.02. Unimodal three-dimensional convolutional neural networks using SR-PET or CT performed worse (AUROC 0.42 +- 0.03 and 0.54 +- 0.01, respectively). A multimodal fusion model laboratory values, SR-PET, and CT -augmented with a pretrained CT branch - achieved the best results (AUROC 0.72 +- 0.01, AUPRC 0.80 +- 0.01). Multimodal deep learning combining SR-PET, CT, and laboratory biomarkers outperformed unimodal approaches for PFS prediction after PRRT. Upon external validation, such models may support risk-adapted follow-up strategies.

CVAug 6, 2025
On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications

Simon Baur, Alexandra Benova, Emilio Dolgener Cantú et al.

Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.

MLAug 6, 2025
Benchmarking Uncertainty and its Disentanglement in multi-label Chest X-Ray Classification

Simon Baur, Wojciech Samek, Jackie Ma

Reliable uncertainty quantification is crucial for trustworthy decision-making and the deployment of AI models in medical imaging. While prior work has explored the ability of neural networks to quantify predictive, epistemic, and aleatoric uncertainties using an information-theoretical approach in synthetic or well defined data settings like natural image classification, its applicability to real life medical diagnosis tasks remains underexplored. In this study, we provide an extensive uncertainty quantification benchmark for multi-label chest X-ray classification using the MIMIC-CXR-JPG dataset. We evaluate 13 uncertainty quantification methods for convolutional (ResNet) and transformer-based (Vision Transformer) architectures across a wide range of tasks. Additionally, we extend Evidential Deep Learning, HetClass NNs, and Deep Deterministic Uncertainty to the multi-label setting. Our analysis provides insights into uncertainty estimation effectiveness and the ability to disentangle epistemic and aleatoric uncertainties, revealing method- and architecture-specific strengths and limitations.