CL CR LGAug 15, 2022

MENLI: Robust Evaluation Metrics from Natural Language Inference

arXiv:2208.07316v517.7143 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable text generation evaluation for researchers and practitioners, offering a more robust alternative to current metrics.

The paper tackled the vulnerability of BERT-based evaluation metrics to adversarial attacks by proposing metrics based on Natural Language Inference (NLI), which showed improved robustness (15%-30% higher) and performance on standard benchmarks (+5% to 30% gain when combined with existing metrics).

Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).

View on arXiv PDF Code

Similar