CL AIApr 4

Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

arXiv:2604.0859540.3

AI Analysis

Provides a lightweight, adjustable evaluation metric for LLM-based systems, but improvements over existing methods are marginal.

TCVA introduces a temperature-controlled aggregation method for LLM evaluation that adapts strictness to application domains, achieving human correlation comparable to RAGAS (Spearman 0.667 vs 0.676) on faithfulness without extra LLM calls.

Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.

View on arXiv PDF

Similar