EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts
For researchers and practitioners using LLMs for scientific manuscript editing, this benchmark reveals that current models still fail to reliably propagate factual edits to dependent claims, highlighting the need for cascade-aware revision tools.
EditPropBench introduces a benchmark to evaluate whether LLM editors propagate factual edits through dependent claims in scientific manuscripts. On the hardest stratum, LLM editors achieve Edit-Ripple Adherence scores between 0.148 and 0.705, with the best system missing about 30% of required cascade updates.
Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as 'medium-scale' or 'a few hundred items' may also become stale, even though they do not repeat the edited number. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and protected unrelated text. EditPropBench provides a controlled manuscript-level benchmark with sentence-level dependency supervision, three editing protocols, adversarial metric probes, stress-test variants, and a metric suite centered on Edit-Ripple Adherence (ERA). On the hard implicit/free-form stratum, five LLM editing systems span ERA 0.148--0.705; even the strongest misses roughly 30% of required cascade updates. A mixed-stratum stress test shows that LLMs retain a positive advantage over deterministic substitution baselines when easy substitution-solvable cases are included. Finally, an audit of recent arXiv cs.CL benchmark and dataset papers finds fact-dependent qualitative claims in 37.2% of papers. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.