LGHCMLNov 18, 2019

Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation

arXiv:1911.08567v124 citations
Originality Incremental advance
AI Analysis

This work addresses the need for non-intrusive, generalizable metrics to optimize dialogue management in conversational AI systems, though it is incremental in nature.

The paper tackled the problem of evaluating dialogue quality across multiple domains by developing a new annotation scheme and feature sets, achieving a 0.76 correlation with user ratings and a 16% improvement in binary satisfaction prediction accuracy.

An automated metric to evaluate dialogue quality is vital for optimizing data driven dialogue management. The common approach of relying on explicit user feedback during a conversation is intrusive and sparse. Current models to estimate user satisfaction use limited feature sets and employ annotation schemes with limited generalizability to conversations spanning multiple domains. To address these gaps, we created a new Response Quality annotation scheme, introduced five new domain-independent feature sets and experimented with six machine learning models to estimate User Satisfaction at both turn and dialogue level. Response Quality ratings achieved significantly high correlation (0.76) with explicit turn-level user ratings. Using the new feature sets we introduced, Gradient Boosting Regression model achieved best (rating [1-5]) prediction performance on 26 seen (linear correlation ~0.79) and one new multi-turn domain (linear correlation 0.67). We observed a 16% relative improvement (68% -> 79%) in binary ("satisfactory/dissatisfactory") class prediction accuracy of a domain-independent dialogue-level satisfaction estimation model after including predicted turn-level satisfaction ratings as features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes