LGDec 1, 2025Code
CLEF: Clinically-Guided Contrastive Learning for Electrocardiogram Foundation ModelsYuxuan Shu, Peter H. Charlton, Fahim Kawsar et al.
The electrocardiogram (ECG) is a key diagnostic tool in cardiovascular health. Single-lead ECG recording is integrated into both clinical-grade and consumer wearables. While self-supervised pretraining of foundation models on unlabeled ECGs improves diagnostic performance, existing approaches do not incorporate domain knowledge from clinical metadata. We introduce a novel contrastive learning approach that utilizes an established clinical risk score to adaptively weight negative pairs: clinically-guided contrastive learning. It aligns the similarities of ECG embeddings with clinically meaningful differences between subjects, with an explicit mechanism to handle missing metadata. On 12-lead ECGs from 161K patients in the MIMIC-IV dataset, we pretrain single-lead ECG foundation models at three scales, collectively called CLEF, using only routinely collected metadata without requiring per-sample ECG annotations. We evaluate CLEF on 18 clinical classification and regression tasks across 7 held-out datasets, and benchmark against 5 foundation model baselines and 3 self-supervised algorithms. When pretrained on 12-lead ECG data and tested on lead-I data, CLEF outperforms self-supervised foundation model baselines: the medium-sized CLEF achieves average AUROC improvements of at least 2.6% in classification and average reductions in MAEs of at least 3.2% in regression. Comparing with existing self-supervised learning algorithms, CLEF improves the average AUROC by at least 1.8%. Moreover, when pretrained only on lead-I data for classification tasks, CLEF performs comparably to the state-of-the-art ECGFounder, which was trained in a supervised manner. Overall, CLEF enables more accurate and scalable single-lead ECG analysis, advancing remote health monitoring. Code and pretrained CLEF models are available at: github.com/Nokia-Bell-Labs/ecg-foundation-model.
LGApr 1
Benchmark Problems and Benchmark Datasets for the evaluation of Machine and Deep Learning methods on Photoplethysmography signals: the D4 report from the QUMPHY projectUrs Hackstein, Jordi Alastruey, Philip Aston et al.
This report is part of the Qumphy project (22HLT01 Qumphy) that is funded by the European Union and is dedicated to the development of measures to quantify the uncertainties associated with Machine Learning algorithms applied to medical problems, in particular the analysis and processing of Photoplethysmography (PPG) signals. In this report, a list of six medical problems that are related to PPG signals and serve as Benchmark Problems is given. Suitable Benchmark datasets and their usage are described also.
LGOct 31, 2025
A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysisCiaran Bench, Oskar Pfeffer, Vivek Desai et al.
In principle, deep learning models trained on medical time-series, including wearable photoplethysmography (PPG) sensor data, can provide a means to continuously monitor physiological parameters outside of clinical settings. However, there is considerable risk of poor performance when deployed in practical measurement scenarios leading to negative patient outcomes. Reliable uncertainties accompanying predictions can provide guidance to clinicians in their interpretation of the trustworthiness of model outputs. It is therefore of interest to compare the effectiveness of different approaches. Here we implement an unprecedented set of eight uncertainty quantification (UQ) techniques to models trained on two clinically relevant prediction tasks: Atrial Fibrillation (AF) detection (classification), and two variants of blood pressure regression. We formulate a comprehensive evaluation procedure to enable a rigorous comparison of these approaches. We observe a complex picture of uncertainty reliability across the different techniques, where the most optimal for a given task depends on the chosen expression of uncertainty, evaluation metric, and scale of reliability assessed. We find that assessing local calibration and adaptivity provides practically relevant insights about model behaviour that otherwise cannot be acquired using more commonly implemented global reliability metrics. We emphasise that criteria for evaluating UQ techniques should cater to the model's practical use case, where the use of a small number of measurements per patient places a premium on achieving small-scale reliability for the chosen expression of uncertainty, while preserving as much predictive performance as possible.
LGMar 23
Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPGMohammad Moulaeifard, Philip J. Aston, Peter H. Charlton et al.
Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.
LGFeb 26, 2025
Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking StudyMohammad Moulaeifard, Peter H. Charlton, Nils Strodthoff
Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.
AIApr 4, 2025
Towards deployment-centric multimodal AI beyond vision and languageXianyuan Liu, Jiayang Zhang, Shuo Zhou et al.
Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.
LGFeb 27, 2025
Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approachesMohammad Moulaeifard, Loic Coquelin, Mantas Rinkevičius et al.
Photoplethysmography (PPG) is a widely used non-invasive physiological sensing technique, suitable for various clinical applications. Such clinical applications are increasingly supported by machine learning methods, raising the question of the most appropriate input representation and model choice. Comprehensive comparisons, in particular across different input representations, are scarce. We address this gap in the research landscape by a comprehensive benchmarking study covering three kinds of input representations, interpretable features, image representations and raw waveforms, across prototypical regression and classification use cases: blood pressure and atrial fibrillation prediction. In both cases, the best results are achieved by deep neural networks operating on raw time series as input representations. Within this model class, best results are achieved by modern convolutional neural networks (CNNs). but depending on the task setup, shallow CNNs are often also very competitive. We envision that these results will be insightful for researchers to guide their choice on machine learning tasks for PPG data, even beyond the use cases presented in this work.
LGFeb 11, 2022
SleepPPG-Net: a deep learning algorithm for robust sleep staging from continuous photoplethysmographyKevin Kotzen, Peter H. Charlton, Sharon Salabi et al.
Introduction: Sleep staging is an essential component in the diagnosis of sleep disorders and management of sleep health. It is traditionally measured in a clinical setting and requires a labor-intensive labeling process. We hypothesize that it is possible to perform robust 4-class sleep staging using the raw photoplethysmography (PPG) time series and modern advances in deep learning (DL). Methods: We used two publicly available sleep databases that included raw PPG recordings, totalling 2,374 patients and 23,055 hours. We developed SleepPPG-Net, a DL model for 4-class sleep staging from the raw PPG time series. SleepPPG-Net was trained end-to-end and consists of a residual convolutional network for automatic feature extraction and a temporal convolutional network to capture long-range contextual information. We benchmarked the performance of SleepPPG-Net against models based on the best-reported state-of-the-art (SOTA) algorithms. Results: When benchmarked on a held-out test set, SleepPPG-Net obtained a median Cohen's Kappa ($κ$) score of 0.75 against 0.69 for the best SOTA approach. SleepPPG-Net showed good generalization performance to an external database, obtaining a $κ$ score of 0.74 after transfer learning. Perspective: Overall, SleepPPG-Net provides new SOTA performance. In addition, performance is high enough to open the path to the development of wearables that meet the requirements for usage in clinical applications such as the diagnosis and monitoring of obstructive sleep apnea.