CLAIApr 26, 2024

CEval: A Benchmark for Evaluating Counterfactual Text Generation

arXiv:2404.17475v226 citationsh-index: 13Has CodeINLG
Originality Synthesis-oriented
AI Analysis

This addresses a methodological gap for researchers in NLP by providing a standardized evaluation framework, though it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of inconsistent evaluation in counterfactual text generation by proposing CEval, a benchmark that unifies metrics, datasets, and baselines, and found that no method perfectly balances counterfactual effectiveness and text quality.

Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes