CLSep 6, 2022

Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?

arXiv:2209.02317v2585 citationsh-index: 30
Originality Incremental advance
AI Analysis

This addresses the generalization problem of evaluation metrics for text generation, particularly in noisy or new domains, but is incremental as it builds on existing BERT-based methods.

The paper examines the robustness of BERTScore, an embedding-based metric for text generation, showing that it can fail in noisy domains with unknown tokens, and finds that using first-layer or character-level embeddings improves robustness.

The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes