Human vs Automatic Metrics: on the Importance of Correlation Design
This addresses a methodological issue for researchers in natural language generation evaluation, but it is incremental as it builds on existing correlation approaches.
The paper investigates the inconsistency in correlation results between automatic evaluation metrics and human judgments in natural language generation, depending on whether system-level or sentence-level analysis is used.
This paper discusses two existing approaches to the correlation analysis between automatic evaluation metrics and human scores in the area of natural language generation. Our experiments show that depending on the usage of a system- or sentence-level correlation analysis, correlation results between automatic scores and human judgments are inconsistent.