QM AI ET IRAug 29, 2025

OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Reinhard Berkels, Michiel van der Heijden, Olivier Fail, Valentine Emmanuel Gnanapragasam

arXiv:2509.02594v15.14 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the need for more realistic and comprehensive evaluation methods for AI clinical assistants, which is crucial for building reliable and trustworthy systems in high-stakes medical scenarios, though it is incremental as it builds on existing RAG and benchmark approaches.

The paper tackled the problem of evaluating large language models for clinical support by introducing a rubric-driven benchmark called HealthBench, which uses open-ended, expert-annotated health conversations to assess competencies like contextual reasoning and uncertainty handling. The result showed that their agentic, RAG-based assistant, DR.INFO, achieved a HealthBench score of 0.51 on a hard subset of 1,000 examples, outperforming leading frontier LLMs, and maintained a lead with a score of 0.54 against similar assistants in a separate evaluation.

Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stake clincal scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, awareness and uncertainty handling etc. To address these limitations, we evaluate our agentic, RAG-based clinical support assistant, DR.INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR.INFO achieves a HealthBench score of 0.51, substantially outperforming leading frontier LLMs (GPT-5, o3, Grok 3, GPT-4, Gemini 2.5, etc.) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence, Pathway.md), it maintains a performance lead with a health-bench score of 0.54. These results highlight DR.INFOs strengths in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and completeness of a response. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building a reliable and trustworthy AI-enabled clinical support assistant.

View on arXiv PDF

Similar