CLMay 24, 2023

Psychological Metrics for Dialog System Evaluation

arXiv:2305.14757v27 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more human-like evaluation tools for dialog systems, offering interpretable metrics that can improve system comparisons and development, though it is incremental in applying established psychology to a known bottleneck.

The authors tackled the problem of evaluating dialog systems by introducing psychologically-grounded metrics, such as emotional entropy and empathy, and demonstrated that these metrics provide novel information uncorrelated with traditional metrics, leading to increased accuracy in predicting crowd-sourced judgments.

We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens in which conversational agents express a diversity of both states (e.g., emotion) and traits (e.g., personality), just as people do. We present five interpretable metrics from established psychology that are fundamental to human communication and relationships: emotional entropy, linguistic style and emotion matching, agreeableness, and empathy. These metrics can be applied (1) across dialogs and (2) on turns within dialogs. The psychological metrics are compared against seven state-of-the-art traditional metrics (e.g., BARTScore and BLEURT) on seven standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate that our proposed metrics offer novel information; they are uncorrelated with traditional metrics, can be used to meaningfully compare dialog systems, and lead to increased accuracy (beyond existing traditional metrics) in predicting crowd-sourced dialog judgements. The interpretability and unique signal of our psychological metrics make them a valuable tool for evaluating and improving dialog systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes