LGAug 6, 2025
Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure DetectionJovana Kljajic, John M. O'Toole, Robert Hogan et al.
Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.
SPAug 27, 2025
Machine-learning competition to grade EEG background patterns in newborns with hypoxic-ischaemic encephalopathyFabio Magarelli, Geraldine B. Boylan, Saeed Montazeri et al.
Machine learning (ML) has the potential to support and improve expert performance in monitoring the brain function of at-risk newborns. Developing accurate and reliable ML models depends on access to high-quality, annotated data, a resource in short supply. ML competitions address this need by providing researchers access to expertly annotated datasets, fostering shared learning through direct model comparisons, and leveraging the benefits of crowdsourcing diverse expertise. We compiled a retrospective dataset containing 353 hours of EEG from 102 individual newborns from a multi-centre study. The data was fully anonymised and divided into training, testing, and held-out validation datasets. EEGs were graded for the severity of abnormal background patterns. Next, we created a web-based competition platform and hosted a machine learning competition to develop ML models for classifying the severity of EEG background patterns in newborns. After the competition closed, the top 4 performing models were evaluated offline on a separate held-out validation dataset. Although a feature-based model ranked first on the testing dataset, deep learning models generalised better on the validation sets. All methods had a significant decline in validation performance compared to the testing performance. This highlights the challenges for model generalisation on unseen data, emphasising the need for held-out validation datasets in ML studies with neonatal EEG. The study underscores the importance of training ML models on large and diverse datasets to ensure robust generalisation. The competition's outcome demonstrates the potential for open-access data and collaborative ML development to foster a collaborative research environment and expedite the development of clinical decision-support tools for neonatal neuromonitoring.