CLAIDec 2, 2021

InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation

arXiv:2112.01589v352 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more robust and efficient automatic evaluation metrics in NLP, particularly for researchers and practitioners in summarization and data2text generation, though it is incremental as it builds on existing string-based metrics and pre-trained models.

The authors tackled the problem of evaluating natural language generation systems by introducing InfoLM, a family of untrained metrics based on pre-trained language models, which achieved statistically significant improvements and over 10 points of correlation gains in summarization and data2text generation tasks.

Assessing the quality of natural language generation systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria. Using direct assessment, we demonstrate that InfoLM achieves statistically significant improvement and over $10$ points of correlation gains in many configurations on both summarization and data2text generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes