CL HCJun 4

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August

arXiv:2606.0627175.4

AI Analysis

For researchers and practitioners in writing education, this provides a systematic comparison of human and LLM feedback, highlighting alignment and divergence.

The paper introduces FOXGLOVE, a dataset comparing feedback from writing instructors and LLMs on argumentative essays, finding that while they distribute feedback similarly across goals, they diverge on specific sentences, and LLM feedback receives higher quality ratings partly due to length.

While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.

View on arXiv PDF

Similar