When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

arXiv:2605.2490234.5

AI Analysis

For developers and users of LLMs in clinical documentation, this work shows that stronger reasoning capability does not automatically improve fidelity-sensitive tasks like SOAP note generation, highlighting the need for task-specific evaluation.

The study evaluates reasoning-enabled LLMs for clinical SOAP note generation and finds that enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, while a non-reasoning GPT-5.4 configuration achieves the highest overall quality.

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

View on arXiv PDF

Similar