SP AI LG PFMay 19, 2025

SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference

Jonathan Dan, Amirhossein Shahbazinia, Christodoulos Kechris, David Atienza

arXiv:2505.18191v15 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of unreliable seizure detection algorithms for clinicians and researchers by providing a standardized benchmarking framework, though it is incremental as it builds on existing challenge formats.

The authors organized a seizure detection challenge using a private EEG dataset of 65 subjects (4,360 hours) to assess algorithm performance, with the top submission achieving an F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of this task.

Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.

View on arXiv PDF

Similar