CLAIJul 12, 2025

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

arXiv:2507.09104v15 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the need for more robust and scalable evaluation methods for large language models, which is crucial for researchers and developers in AI, though it appears incremental by building on existing judge model concepts.

The paper tackles the problem of narrow specialization and limited robustness in LLM-as-judge models by introducing CompassJudger-2, a generalist judge model that achieves superior results across multiple benchmarks, with a 7B model showing competitive accuracy against significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B.

Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes