A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics
This work addresses the problem of unreliable automated evaluation for QA systems, which is crucial for researchers and developers, but it is incremental as it primarily analyzes limitations and suggests a potential solution without implementation.
The study analyzed existing automatic evaluation metrics for question-answering (QA) systems, finding that while these metrics correlate highly with each other based on question types, no single metric effectively estimates human-like evaluation scores.
The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.