CL AIMar 11

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

Liang Chen, Qi Liu, Wenhuan Lin, Feng Liang

arXiv:2604.0002275.3

Predicted impact top 82% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of ensuring dialogue evaluation metrics align with real-world business outcomes for AI developers and platforms, though it is incremental in refining existing evaluation practices.

The study tested the criterion validity of a multi-dimensional rubric for evaluating conversational AI on a Chinese matchmaking platform, finding that only specific dimensions like Need Elicitation and Pacing Strategy were significantly associated with business conversion, while equal-weighted composites diluted performance.

Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

View on arXiv PDF

Similar