HCAIMar 24, 2025

SPHERE: An Evaluation Card for Human-AI Systems

arXiv:2504.07971v113 citationsh-index: 12ACL
Originality Synthesis-oriented
AI Analysis

This addresses the problem of inconsistent evaluation standards for human-AI systems, particularly for researchers and practitioners, but is incremental as it builds on existing evaluation frameworks.

The paper tackles the challenge of evaluating human-AI interaction systems by introducing SPHERE, an evaluation card with five dimensions, and applies it to review 39 systems to outline current practices and provide recommendations for improvement.

In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes