Training Large Language Models to Predict Clinical Events

Benjamin Turtel, Paul Wilczewski, Kris Skotheim

arXiv:2605.1281744.2

AI Analysis

For clinical NLP researchers, this provides a method to generate reusable prediction supervision from longitudinal notes without hand-engineered features, though the dataset is small and the gains are moderate.

The authors extend Foresight Learning to clinical prediction by converting MIMIC-III notes into 6,900 training examples across five event types. A small LoRA adapter reduces expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, slightly outperforming GPT-5 on held-out questions.

Longitudinal clinical notes contain rich evidence of how patients evolve over time, but converting this signal into training supervision for clinical prediction remains challenging. We extend Foresight Learning to clinical prediction by converting time-ordered MIMIC-III notes into examples consisting of past patient context, a natural-language question about a possible future event, and a label resolved from later documentation. This process yields 6,900 prediction examples from 702 admissions across medications, procedures, organ support, microbiology, and mortality. A small LoRA adapter trained on these examples improves over the prompted base model, reducing expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, while slightly outperforming GPT-5 point estimates on held-out questions. The approach enables reusable clinical prediction supervision from longitudinal notes without hand-engineered structured features or endpoint-specific classifiers.

View on arXiv PDF

Similar