CLLGSep 27, 2022

EditEval: An Instruction-Based Benchmark for Text Improvements

Meta AI
arXiv:2209.13331v136 citationsh-index: 71
AI Analysis

This addresses the problem of evaluating iterative text improvements for researchers and developers in natural language processing, though it is incremental as it builds on existing datasets and methods.

The authors tackled the lack of comprehensive evaluation for text editing capabilities by introducing EditEval, an instruction-based benchmark, and found that InstructGPT and PEER performed best, though most baselines fell below supervised state-of-the-art, especially in tasks like neutralizing and updating information.

Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the style more consistent. Even so, comprehensive evaluation of a model's capacity to perform these skills and the ability to edit remains sparse. This work presents EditEval: An instruction-based, benchmark and evaluation suite that leverages high-quality existing and new datasets for automatic evaluation of editing capabilities such as making text more cohesive and paraphrasing. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA, particularly when neutralizing and updating information. Our analysis also shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models. Through the release of this benchmark and a publicly available leaderboard challenge, we hope to unlock future research in developing models capable of iterative and more controllable editing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes