CLAIOct 21, 2024

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Peking U
arXiv:2410.16256v148 citationsh-index: 33Has Code
Originality Incremental advance
AI Analysis

This provides a comprehensive and flexible automated evaluation solution for LLM researchers and developers, though it appears incremental as it builds on existing judge model concepts.

The paper tackles the problem of costly and irreproducible human-based evaluation of large language models by introducing CompassJudger-1, an open-source all-in-one judge model that performs various evaluation tasks, and establishes JudgerBench, a new benchmark for subjective evaluation tasks.

Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbf{JudgerBench}, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes