A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
This work addresses the need for reliable accuracy assessment in natural language generation, particularly for data-to-text systems, though it appears incremental as it focuses on methodology refinement rather than a breakthrough.
The authors tackled the problem of evaluating accuracy in data-to-text systems by proposing a gold-standard human evaluation methodology, which they applied to computer-generated basketball summaries and used to validate automated metrics.
Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics