Mind the Missing: Variable-Aware Representation Learning for Irregular EHR Time Series using Large Language Models
This addresses a critical challenge in healthcare data analysis for clinicians and researchers, though it is incremental as it builds on existing LLM-based approaches.
The paper tackled the problem of irregular sampling and high missingness in electronic health record (EHR) time series by proposing VITAL, a variable-aware framework using large language models, which outperformed state-of-the-art methods on benchmark datasets and maintained robust performance under high missingness.
Irregular sampling and high missingness are intrinsic challenges in modeling time series derived from electronic health records (EHRs),where clinical variables are measured at uneven intervals depending on workflow and intervention timing. To address this, we propose VITAL, a variable-aware, large language model (LLM) based framework tailored for learning from irregularly sampled physiological time series. VITAL differentiates between two distinct types of clinical variables: vital signs, which are frequently recorded and exhibit temporal patterns, and laboratory tests, which are measured sporadically and lack temporal structure. It reprograms vital signs into the language space, enabling the LLM to capture temporal context and reason over missing values through explicit encoding. In contrast, laboratory variables are embedded either using representative summary values or a learnable [Not measured] token, depending on their availability. Extensive evaluations on the benchmark datasets from the PhysioNet demonstrate that VITAL outperforms state of the art methods designed for irregular time series. Furthermore, it maintains robust performance under high levels of missingness, which is prevalent in real world clinical scenarios where key variables are often unavailable.