QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

arXiv:2605.1738216.7

Predicted impact top 93% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners evaluating generative AI, QQJ provides a practical, interpretable, and scalable method that improves human alignment and diagnostic capability over existing automatic and LLM-based evaluators.

QQJ introduces a scalable evaluation framework that bridges human judgment and automated assessment by anchoring in expert-designed rubrics and calibrating LLM evaluators with small annotation sets, achieving stronger alignment with human judgment than traditional metrics and unconstrained LLM evaluators on text and image generation tasks.

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

View on arXiv PDF

Similar