IR CLJan 14

Leveraging Large Language Models to Extract and Translate Medical Information in Doctors' Notes for Health Records and Diagnostic Billing Codes

Peter Hartnett, Chung-Chi Huang, Sarah Hartnett, David Hartnett

arXiv:2603.22625h-index: 8Has Code

AI Analysis

For healthcare providers and patients, this work addresses the administrative burden of EHR documentation and diagnostic coding while preserving privacy, but the results are incremental as they confirm current limitations of small local models.

This thesis explores using open-weight LLMs (7B-20B parameters) for on-device, offline automatic ICD-10-CM coding from physician notes to reduce burnout and maintain privacy. Results show near-perfect formatting compliance but poor code accuracy, with few-shot prompting degrading performance and RAG causing context saturation, concluding that fully automated coding is not yet reliable and a human-in-the-loop approach is more practical.

Physician burnout in the United States has reached critical levels, driven in part by the administrative burden of Electronic Health Record (EHR) documentation and complex diagnostic codes. To relieve this strain and maintain strict patient privacy, this thesis explores an on-device, offline automatic medical coding system. The work focuses on using open-weight Large Language Models (LLMs) to extract clinical information from physician notes and translate it into ICD-10-CM diagnostic codes without reliance on cloud-based services. A privacy-focused pipeline was developed using Ollama, LangChain, and containerized environments to evaluate multiple open-weight models, including Llama 3.2, Mistral, Phi, and DeepSeek, on consumer-grade hardware. Model performance was assessed for zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting strategies using a novel benchmark of synthetic medical notes. Results show that strict JSON schema enforcement achieved near 100% formatting compliance, but accurate generation of specific diagnostic codes remains challenging for smaller local models (7B-20B parameters). Contrary to common prompt-engineering guidance, few-shot prompting degraded performance through overfitting and hallucinations. While RAG enabled limited discovery of unseen codes, it frequently saturated context windows, reducing overall accuracy. The findings suggest that fully automated unsupervised coding with local open-source models is not yet reliable; instead, a human-in-the-loop assisted coding approach is currently the most practical path forward. This work contributes a reproducible local LLM architecture and benchmark dataset for privacy-preserving medical information extraction and coding.

View on arXiv PDF Code

Similar