SEAIMay 7

An Empirical Study of Proactive Coding Assistants in Real-World Software Development

arXiv:2605.0570024.51 citationsh-index: 5
Predicted impact top 19% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and developers of proactive coding assistants, this paper highlights the critical need for real developer behavior data, as simulated data can mislead evaluation and training.

This paper investigates the gap between simulated and real IDE interaction traces for proactive coding assistants, finding that simulated traces differ substantially in behavioral diversity, temporal structure, and exploratory patterns. It introduces ProCodeBench, a real-world benchmark, and shows that current approaches perform poorly on real traces, with simulation-based evaluation overestimating real-world performance.

Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer behavior data. Existing studies therefore often rely on LLM-simulated IDE traces, whose fidelity to real development behavior remains unclear. In this paper, we investigate this simulation-to-reality gap through a large-scale empirical study. We collect real IDE interaction traces from 1{,}246 experienced industry developers over three consecutive days using a custom Visual Studio Code extension, and construct paired LLM-simulated traces for controlled comparison. Our analysis shows that simulated traces differ substantially from real traces in behavioral diversity, temporal structure, and exploratory patterns. Based on the collected data, we introduce \textbf{ProCodeBench}, a real-world benchmark for proactive intent prediction. Experiments with representative LLMs, retrieval-augmented methods, and agentic baselines show that current approaches remain far from reliable under real IDE traces, suggesting that simulation-based evaluation can overestimate real-world performance. Finally, our training study shows that simulated data cannot replace real data, but can complement it when used before real-world fine-tuning. These findings highlight the importance of real developer behavior data for evaluating and training proactive coding assistants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes