No-Human in the Loop: Agentic Evaluation at Scale for Recommendation
This provides a reproducible benchmark and evaluation protocol for LLMs as judges, offering actionable guidance on scaling and reliability for researchers and practitioners in AI evaluation.
The paper tackled the problem of evaluating large language models (LLMs) as judges for scalable and trustworthy assessment by presenting ScalingEval, a large-scale benchmarking study comparing 36 LLMs across product categories using a consensus-driven protocol, with results showing Gemini 1.5 Pro as the best overall performer and strong consensus in structured domains but disagreement in lifestyle categories.
Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.