AI LGJan 26

Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

Zhichao Yang, Sepehr Janghorbani, Dongxu Zhang, Jun Han, Qian Qian, Andrew Ressler, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

arXiv:2601.18706v17.56 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses scalability issues in rubric-based evaluation and training for healthcare LLMs, though it appears incremental as it builds on existing rubric methods.

The paper tackles the problem of high development costs for domain-specific rubrics in healthcare LLM evaluation by introducing Health-SCORE, a scalable framework that reduces effort while maintaining performance comparable to human-created rubrics.

Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.

View on arXiv PDF

Similar