AIJun 11, 2025

The NordDRG AI Benchmark for Large Language Models

arXiv:2506.13790v3h-index: 3
Originality Synthesis-oriented
AI Analysis

This provides a reproducible benchmark for evaluating LLMs in hospital funding systems, addressing transparency and auditability concerns, though it is incremental as it focuses on a specific domain.

The paper tackles the lack of open benchmarks for large language models (LLMs) in clinical coding for hospital funding by releasing NordDRG-AI-Benchmark, a rule-complete test bed for Diagnosis-Related Group (DRG) reasoning, where GPT-5 Thinking and Opus 4.1 scored 13/13 on logic tasks, but only GPT-5 Thinking solved 7/13 on full grouper emulation.

Large language models (LLMs) are being piloted for clinical coding and decision support, yet no open benchmark targets the hospital-funding layer where Diagnosis-Related Groups (DRGs) determine reimbursement. In most OECD systems, DRGs route a substantial share of multi-trillion-dollar health spending through governed grouper software, making transparency and auditability first-order concerns. We release NordDRG-AI-Benchmark, the first public, rule-complete test bed for DRG reasoning. The package includes (i) machine-readable approximately 20-sheet NordDRG definition tables and (ii) expert manuals and change-log templates that capture governance workflows. It exposes two suites: a 13-task Logic benchmark (code lookup, cross-table inference, grouping features, multilingual terminology, and CC/MCC validity checks) and a 13-task Grouper benchmark that requires full DRG grouper emulation with strict exact-match scoring on both the DRG and the triggering drg_logic.id. Lightweight reference agents (LogicAgent, GrouperAgent) enable artefact-only evaluation. Under an artefact-only (no web) setting, on the 13 Logic tasks GPT-5 Thinking and Opus 4.1 score 13/13, o3 scores 12/13; mid-tier models (GPT-5 Thinking Mini, o4-mini, GPT-5 Fast) achieve 6-8/13, and remaining models score 5/13 or below. On full grouper emulation across 13 tasks, GPT-5 Thinking solves 7/13, o3 6/13, o4-mini 3/13; GPT-5 Thinking Mini solves 1/13, and all other tested endpoints score 0/13. To our knowledge, this is the first public report of an LLM partially emulating the complete NordDRG grouper logic with governance-grade traceability. Coupling a rule-complete release with exact-match tasks and open scoring provides a reproducible yardstick for head-to-head and longitudinal evaluation in hospital funding. Benchmark materials available in Github.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes