HCMay 22

MindCopilot: Towards Formalizing and Evaluating Granular Human-LLM Co-Writing

arXiv:2605.2353597.8
Predicted impact top 1% in HC · last 90 daysOriginality Highly original
AI Analysis

For researchers and developers of writing assistants, this work provides a formal framework and evaluation metrics to assess proactive co-writing systems, addressing the gap in evaluating real-time user interaction.

The paper formalizes human-LLM co-writing as a Markov Decision Process and proposes interaction-aware metrics (Co-Writing Fidelity Suite) to evaluate proactive writing assistants. A simulation study across 16 domains and a user study with 30 participants show that these metrics capture user acceptance and editing effort better than output-only metrics.

Recent writing assistants are increasingly shifting from passive, prompt-driven interaction to proactive, suggestion-based completion, which integrates localized continuations into the writing flow and reduces coordination burden. However, existing evaluations simply focus on output quality, failing to capture how users accept, edit, or repair suggestions in real-time interaction, and thus obscuring the true usability of proactive co-writing systems. To address this gap, we adopt a sequential, behavior-centered view of interactive writing and formalize co-writing as a Human-in-the-Loop Markov Decision Process, modeling writing as an interaction shaped by user acceptance and editing decisions. Based on this formulation, we introduce the Co-Writing Fidelity Suite, an interaction-aware metric suite that captures both user-assistant alignment and cognitive editing effort, including Hierarchical Acceptance Rate and Knowledge-aware Editing Distance. We conduct a large-scale simulation study across 16 writing domains, using 1,688 controlled continuation queries sampled from different writing stages. Our analysis reveals systematic effects of interaction structure on acceptance behavior and editing cost. A follow-up user study with 30 participants confirms that these behavioral patterns align with real user experience. Together, our findings demonstrate that interaction-aware evaluation provides insights beyond output-only metrics and informs the design of more effective proactive writing assistants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes