ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
This work addresses the need for improved AI reasoning in disease diagnosis for medical applications, representing an incremental advance by adapting existing methods to a specific domain.
The paper tackles the problem of applying large language models to clinical diagnosis by introducing ClinicalGPT-R1, which outperforms GPT-4o in Chinese diagnostic tasks and matches GPT-4 in English settings on a challenging dataset.
Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.