Enhancing LLMs with Smart Preprocessing for EHR Analysis
This addresses privacy and computational constraints for healthcare applications using LLMs, though it appears incremental as it builds on existing preprocessing and RAG techniques.
The paper tackles the challenge of applying Large Language Models (LLMs) to Electronic Health Records (EHRs) in privacy-sensitive, resource-constrained healthcare settings by introducing a preprocessing framework using regex and Retrieval-Augmented Generation (RAG) to enhance smaller LLMs, with experimental results showing significant performance improvements on datasets like MIMIC-IV.
Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing; however, their application in sensitive domains such as healthcare, especially in processing Electronic Health Records (EHRs), is constrained by limited computational resources and privacy concerns. This paper introduces a compact LLM framework optimized for local deployment in environments with stringent privacy requirements and restricted access to high-performance GPUs. Our approach leverages simple yet powerful preprocessing techniques, including regular expressions (regex) and Retrieval-Augmented Generation (RAG), to extract and highlight critical information from clinical notes. By pre-filtering long, unstructured text, we enhance the performance of smaller LLMs on EHR-related tasks. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available datasets (MIMIC-IV), with additional comparisons against fine-tuned LLMs on MIMIC-IV. Experimental results demonstrate that our preprocessing strategy significantly supercharges the performance of smaller LLMs, making them well-suited for privacy-sensitive and resource-constrained applications. This study offers valuable insights into optimizing LLM performance for local, secure, and efficient healthcare applications. It provides practical guidance for real-world deployment for LLMs while tackling challenges related to privacy, computational feasibility, and clinical applicability.