Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning
This research addresses diagnostic errors in healthcare by optimizing AI systems for clinical reasoning, though it is incremental as it builds on existing benchmarks and methods.
The study tackled the problem of improving clinical diagnostic reasoning by comparing in-domain versus out-ofomain language models and multi-task versus single-task training on the DR.BENCH framework, resulting in a multi-task, clinically trained model achieving a new state-of-the-art ROUGE-L score of 28.55.
Generative artificial intelligence (AI) is a promising direction for augmenting clinical diagnostic decision support and reducing diagnostic errors, a leading contributor to medical errors. To further the development of clinical AI systems, the Diagnostic Reasoning Benchmark (DR.BENCH) was introduced as a comprehensive generative AI framework, comprised of six tasks representing key components in clinical reasoning. We present a comparative analysis of in-domain versus out-of-domain language models as well as multi-task versus single task training with a focus on the problem summarization task in DR.BENCH (Gao et al., 2023). We demonstrate that a multi-task, clinically trained language model outperforms its general domain counterpart by a large margin, establishing a new state-of-the-art performance, with a ROUGE-L score of 28.55. This research underscores the value of domain-specific training for optimizing clinical diagnostic reasoning tasks.