CL AIJun 16, 2025

HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

arXiv:2506.21578v11 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the critical need for more holistic and realistic assessments of AI readiness in healthcare, moving beyond physician-centric benchmarks to include interprofessional knowledge gaps, though it is incremental as it extends existing evaluation methods to a new language and domain scope.

The authors tackled the problem of evaluating Large Language Models (LLMs) in healthcare by introducing HealthQA-BR, a Portuguese-language benchmark covering multiple health professions, and found that while top models like GPT 4.1 achieved 86.6% overall accuracy, performance varied drastically, dropping to as low as 60.0% in Neurosurgery and 68.4% in Social Work.

The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil's national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular analysis shows performance plummets from near-perfect in specialties like Ophthalmology (98.7%) to barely passing in Neurosurgery (60.0%) and, most notably, Social Work (68.4%). This "spiky" knowledge profile is a systemic issue observed across all models, demonstrating that high-level scores are insufficient for safety validation. By publicly releasing HealthQA-BR and our evaluation suite, we provide a crucial tool to move beyond single-score evaluations and toward a more honest, granular audit of AI readiness for the entire healthcare team.

View on arXiv PDF

Similar