MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine
This addresses the need for safer model development in medical AI by exposing reasoning vulnerabilities like Chain-of-Thought faithfulness and sycophancy, though it is incremental as it builds on existing benchmarks.
The paper tackles the problem of evaluating reasoning reliability in large language models (LLMs) for medical decision-support by introducing MedOmni-45 Degrees, a benchmark that quantifies safety-performance trade-offs, showing a consistent trade-off with no model surpassing the diagonal and QwQ-32B performing closest at 43.81 Degrees.
With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.