CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
This work addresses the need for standardized evaluation of language models in clinical outcome prediction, providing a benchmark for researchers and practitioners in medical AI.
The paper introduced CliniBench, a benchmark for comparing generative and encoder-based language models in predicting discharge diagnoses from admission notes using the MIMIC-IV dataset, finding that encoder-based classifiers consistently outperformed generative models, with retrieval augmentation improving generative model performance.
With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.