AIHCMay 15, 2025

Evaluations at Work: Measuring the Capabilities of GenAI in Use

arXiv:2505.10742v2h-index: 16
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of human-AI collaboration in messy, multi-turn settings, offering a methodological framework and insights for improving AI-augmented work processes, though it is incremental in advancing evaluation methods.

The authors tackled the problem of evaluating generative AI in real-world collaborative tasks by developing a framework that decomposes tasks into subtasks and tracks performance and user strategies, revealing that while integrating LLM content improves output quality, benefits are moderated by factors like incoherence and knowledge distance.

Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration. We present an evaluation framework that decomposes real-world tasks into interdependent subtasks, letting us track both LLM performance and users' strategies across a dialogue. Complementing this framework, we develop a suite of metrics, including a composite usage derived from semantic similarity, word overlap, and numerical matches; structural coherence; intra-turn diversity; and a novel measure of the "information frontier" reflecting the alignment between AI outputs and users' working knowledge. We demonstrate our methodology in a financial valuation task that mirrors real-world complexity. Our empirical findings reveal that while greater integration of LLM-generated content generally enhances output quality, its benefits are moderated by factors such as response incoherence, excessive subtask diversity, and the distance of provided information from users' existing knowledge. These results suggest that proactive dialogue strategies designed to inject novelty may inadvertently undermine task performance. Our work thus advances a more holistic evaluation of human-AI collaboration, offering both a robust methodological framework and actionable insights for developing more effective AI-augmented work processes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes