CLFeb 3

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang

arXiv:2602.03619v14.79 citationsh-index: 40Has Code

Originality Highly original

AI Analysis

This work addresses the problem of scalable and accurate evaluation for DeepResearch report generation, which is incremental as it builds on existing rubric-based methods with novel training techniques.

The paper tackles the challenge of evaluating DeepResearch-generated reports by proposing a pipeline to train query-specific rubric generators from human preferences, resulting in systems that outperform open-source baselines and match leading closed-source models on the DeepResearch Bench.

Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

View on arXiv PDF

Similar