RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

arXiv:2601.08654v13.59 citationsh-index: 3Has Code

Originality Highly original

AI Analysis

This addresses the problem of robust and reliable LLM evaluation for researchers and practitioners, offering a novel framework to improve alignment and stability, though it is incremental in refining the LLM-as-a-Judge paradigm.

The paper tackles the challenge of aligning frozen black-box LLMs with human standards for scalable rubric-based evaluation by introducing RULERS, a compiler-executor framework that transforms natural language rubrics into executable specifications, resulting in significantly outperforming baselines in human agreement and enabling smaller models to rival larger proprietary judges.

The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.

View on arXiv PDF Code

Similar