ML LGSep 10, 2025

PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

Jessica Gronsbell, Vidul Ayakulangara Panickan, Chris Lin, Thomas Charlon, Chuan Hong, Doudou Zhou, Linshanshan Wang, Jianhui Gao, Shirley Zhou, Yuan Tian, Yaqi Shi, Ziming Gan

arXiv:2509.08553v14.5h-index: 13Has Code

Originality Synthesis-oriented

AI Analysis

This addresses data integration problems for translational researchers working with multi-institutional EHR data, though it appears incremental as it builds on existing harmonization methods.

The paper tackles the challenge of harmonizing Electronic Health Record (EHR) data across institutions due to heterogeneity and privacy issues by introducing PEHRT, a standardized pipeline that maps data to standard coding systems and uses machine learning to generate research-ready datasets without sharing individual-level data.

Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce $\textit{PEHRT}$, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.

View on arXiv PDF

Similar