CLJun 3, 2019

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation

Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, William W. Cohen

arXiv:1906.01081v132.01154 citations

Originality Incremental advance

AI Analysis

This addresses the issue of unreliable evaluation in table-to-text generation for researchers and practitioners, though it is incremental as it builds on prior work like Wiseman et al. (2017).

The paper tackled the problem of evaluating table-to-text generation when reference texts diverge from the source data, showing that existing metrics like BLEU and ROUGE have poor correlation with human judgments. They proposed a new metric, PARENT, which aligns n-grams to the data before computing precision and recall, and demonstrated it correlates better with human judgments on WikiBio and is applicable to WebNLG data.

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.

View on arXiv PDF

Similar