CLAIDec 19, 2025

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Georgia Tech
arXiv:2512.17267v11 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the problem of scarce or slow human feedback for evaluating AI applications, particularly in research and prototyping, with incremental improvements in metric synthesis.

The paper tackles the challenge of evaluating user-facing AI applications in open-ended domains by introducing AutoMetrics, a framework that synthesizes evaluation metrics under low-data constraints, improving Kendall correlation with human ratings by up to 33.4% over existing methods while requiring fewer than 100 feedback points.

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes