LGJun 10, 2025

SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models

arXiv:2506.08574v23 citationsh-index: 24Has Codenpj Digital Medicine
Originality Incremental advance
AI Analysis

This work addresses barriers to clinical adoption of sleep staging models by providing a standardized evaluation framework and ensemble method, though it is incremental in improving existing architectures and evaluation practices.

The paper tackles the problem of fair evaluation and generalization in automatic sleep staging by introducing SLEEPYLAND, an open-source framework with extensive datasets and pre-trained models, and SOMNUS, an ensemble model that achieves robust performance across diverse datasets, outperforming individual models and previous state-of-the-art methods, with macro-F1 scores ranging from 68.7% to 87.2% and surpassing human scorers in some cases.

Despite advances in deep learning for automatic sleep staging, clinical adoption remains limited due to challenges in fair model evaluation, generalization across diverse datasets, model bias, and variability in human annotations. We present SLEEPYLAND, an open-source sleep staging evaluation framework designed to address these barriers. It includes more than 220'000 hours in-domain (ID) sleep recordings, and more than 84'000 hours out-of-domain (OOD) sleep recordings, spanning a broad range of ages, sleep-wake disorders, and hardware setups. We release pre-trained models based on high-performing SoA architectures and evaluate them under standardized conditions across single- and multi-channel EEG/EOG configurations. We introduce SOMNUS, an ensemble combining models across architectures and channel setups via soft voting. SOMNUS achieves robust performance across twenty-four different datasets, with macro-F1 scores between 68.7% and 87.2%, outperforming individual models in 94.9% of cases. Notably, SOMNUS surpasses previous SoA methods, even including cases where compared models were trained ID while SOMNUS treated the same data as OOD. Using a subset of the BSWR (N=6'633), we quantify model biases linked to age, gender, AHI, and PLMI, showing that while ensemble improves robustness, no model architecture consistently minimizes bias in performance and clinical markers estimation. In evaluations on OOD multi-annotated datasets (DOD-H, DOD-O), SOMNUS exceeds the best human scorer, i.e., MF1 85.2% vs 80.8% on DOD-H, and 80.2% vs 75.9% on DOD-O, better reproducing the scorer consensus than any individual expert (k = 0.89/0.85 and ACS = 0.95/0.94 for healthy/OSA cohorts). Finally, we introduce ensemble disagreement metrics - entropy and inter-model divergence based - predicting regions of scorer disagreement with ROC AUCs up to 0.828, offering a data-driven proxy for human uncertainty.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes