MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
This addresses the problem of evaluating LLMs for safe and reliable use in high-stakes medical applications, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the lack of benchmarks for testing LLMs on long multi-turn conversations in medical scenarios by introducing MedMT-Bench, a challenging benchmark with 400 test cases, and found that 17 frontier models underperformed with overall accuracy below 60.00%, with the best model reaching 59.75%.
Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material