LGNov 4, 2025

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

arXiv:2511.03047v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses the challenge of scalable and reliable evaluation for enterprise AI systems where human annotation is impractical.

The paper tackles the problem of evaluating large language models in objective-driven interactions by introducing the first set of unsupervised metrics that leverage statistical properties and fine-tuned LLMs, validated on open-domain and task-specific data.

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes