Stefan Kraft

LG
h-index36
4papers
24citations
Novelty29%
AI Score35

4 Papers

LGSep 20, 2024
ALPEC: A Comprehensive Evaluation Framework and Dataset for Machine Learning-Based Arousal Detection in Clinical Practice

Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke et al.

Detecting arousals in sleep is essential for diagnosing sleep disorders. However, using Machine Learning (ML) in clinical practice is impeded by fundamental issues, primarily due to mismatches between clinical protocols and ML methods. Clinicians typically annotate only the onset of arousals, while ML methods rely on annotations for both the beginning and end. Additionally, there is no standardized evaluation methodology tailored to clinical needs for arousal detection models. This work addresses these issues by introducing a novel post-processing and evaluation framework emphasizing approximate localization and precise event count (ALPEC) of arousals. We recommend that ML practitioners focus on detecting arousal onsets, aligning with clinical practice. We examine the impact of this shift on current training and evaluation schemes, addressing simplifications and challenges. We utilize a novel comprehensive polysomnographic dataset (CPS) that reflects the aforementioned clinical annotation constraints and includes modalities not present in existing polysomnographic datasets. We release the dataset alongside this paper, demonstrating the benefits of leveraging multimodal data for arousal onset detection. Our findings significantly contribute to integrating ML-based arousal detection in clinical settings, reducing the gap between technological advancements and clinical needs.

CRJun 12, 2024Code
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Edoardo Debenedetti, Javier Rando, Daniel Paleka et al.

Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed defenses to prevent the model from leaking the secret. During the second phase, teams were challenged to extract the secrets hidden for defenses proposed by the other teams. This report summarizes the main insights from the competition. Notably, we found that all defenses were bypassed at least once, highlighting the difficulty of designing a successful defense and the necessity for additional research to protect LLM systems. To foster future research in this direction, we compiled a dataset with over 137k multi-turn attack chats and open-sourced the platform.

LGOct 24, 2025
Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study

Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke et al.

Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern \textit{when} and \textit{why} to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the \textit{start} of scoring or as a post-hoc quality-control (\textit{QC}) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30\% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.

LGOct 17, 2025
Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025

Emily Alsentzer, Marie-Laure Charpignon, Bill Chen et al.

The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of "Explainability, Interpretability, and Transparency," "Uncertainty, Bias, and Fairness," "Causality," "Domain Adaptation," "Foundation Models," "Learning from Small Medical Data," "Multimodal Methods," and "Scalable, Translational Healthcare Solutions."