AI CL LGSep 22, 2025

The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song

arXiv:2509.18234v218.18 citationsh-index: 23

Originality Incremental advance

AI Analysis

This work highlights a critical issue for healthcare AI by showing that current medical benchmarks mask failure modes and do not reflect real-world readiness, urging a shift towards robustness and accountability.

The study tackled the problem of large frontier models achieving high scores on medical benchmarks despite lacking true medical understanding, revealing through stress tests that these models often guess correctly without key inputs, flip answers under trivial changes, and fabricate flawed reasoning, exposing brittleness and shortcut learning across six flagship models on six benchmarks.

Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.

View on arXiv PDF

Similar