AILGOct 8, 2025

Auto-Prompt Ensemble for LLM Judge

arXiv:2510.06538v14 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the evaluation gap between human and LLM judges for AI researchers and practitioners, representing an incremental improvement through adaptive ensemble methods.

The paper tackles the problem of unreliable LLM judges by proposing the Auto-Prompt Ensemble (APE) framework, which improves reliability by automatically learning evaluation dimensions from failure cases, resulting in an increase in GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in zero-shot settings.

We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes