LG AIOct 20, 2025

Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling

Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, Bolin Ding

arXiv:2510.17314v127.316 citationsh-index: 5

Originality Highly original

AI Analysis

This work addresses the scalability and interpretability challenges in reward modeling for AI alignment, offering a data-efficient solution that could reduce reliance on expensive datasets.

The paper tackles the problem of costly and opaque reward modeling for aligning Large Language Models by proposing a training-free framework that extracts generalizable rubrics from limited preference data, achieving strong performance with only 70 preference pairs (1.5% of source data) and enabling smaller models to outperform specialized ones.

Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability. We address these limitations with a novel, training-free framework built on a key assumption: \textit{evaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries}, a property that enables remarkable data efficiency. Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided \textbf{Propose-Evaluate-Revise} pipeline. Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an \textbf{information-theoretic coding rate}. The final output is an interpretable, hierarchical "Theme-Tips" rubric set. Extensive experiments demonstrate the framework's exceptional data efficiency and performance. Critically, using just 70 preference pairs (1.5\% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts. This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.

View on arXiv PDF

Similar