LG AI HCOct 24, 2025

Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study

Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke, Gjergji Kasneci, Hendrik Lensch

arXiv:2510.21389v1h-index: 36

Originality Incremental advance

AI Analysis

This work addresses the challenge of trustworthy AI integration for clinicians in sleep medicine, though it is incremental as it builds on existing explainable AI methods in a specific application.

The study tackled the problem of integrating AI into clinical practice by evaluating how transparent (white-box) versus black-box AI assistance affects sleep medicine practitioners' performance in scoring nocturnal arousal events. The result showed that transparent AI assistance as a quality-control step improved event-level performance by about 30% over black-box assistance, with collaboration reducing inter-rater variability and most participants favoring transparency.

Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern \textit{when} and \textit{why} to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the \textit{start} of scoring or as a post-hoc quality-control (\textit{QC}) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30\% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.

View on arXiv PDF

Similar