CLMay 24, 2021

Towards Standard Criteria for human evaluation of Chatbots: A Survey

arXiv:2105.11197v118 citations
Originality Synthesis-oriented
AI Analysis

This work aims to improve the consistency and comparability of chatbot evaluations for researchers and developers, though it is incremental as it builds on existing survey methods.

The paper addresses the lack of standardized criteria in human evaluation of chatbots, which leads to reliability and replication issues, by surveying 105 papers and proposing five standard criteria with precise definitions.

Human evaluation is becoming a necessity to test the performance of Chatbots. However, off-the-shelf settings suffer the severe reliability and replication issues partly because of the extremely high diversity of criteria. It is high time to come up with standard criteria and exact definitions. To this end, we conduct a through investigation of 105 papers involving human evaluation for Chatbots. Deriving from this, we propose five standard criteria along with precise definitions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes