Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming
For CS educators and tool designers, this study shows that GenAI assistants do not inherently harm comprehension, but passive use does, highlighting the need for educational interventions that promote active engagement.
A within-subject study with 15 graduate CS students found that GitHub Copilot improved task performance but not overall code comprehension, revealing a comprehension-performance decoupling. Active verification of generated code strongly predicted comprehension, with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension ones.
Teaching Computer Science (CS) students how to comprehend and maintain legacy code bases is a critical challenge in software engineering education. While Generative AI (GenAI) assistants like GitHub Copilot improve task completion speed and correctness, their impact on code understanding remains unclear. We conducted a within-subject study with 15 graduate CS students completing feature implementation tasks with and without Copilot. Despite significant performance improvements, participants showed no overall comprehension improvement ($p=0.59$), revealing a \textit{comprehension-performance decoupling}. Further analysis uncovered a \textit{comprehension trade-off}: performance gains negatively correlated with reverse engineering comprehension ($ρ=-0.57$, $p=0.026$) but showed a positive trend with implementation comprehension ($ρ=0.50$, $p=0.06$). A follow-up behavioral analysis revealed that \textit{how} students used Copilot determined outcomes: Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension ($p<0.001$, $r=0.96$), with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension participants. These findings suggest that GenAI tools do not inherently undermine comprehension; rather, passive consumption patterns do. This suggests a need to alter programming education to teach system-level verification skills, and the need to redesign educational GenAI tools to scaffold active cognitive engagement.