SEMay 21

On the Reliability of Code Comprehension Proxies

arXiv:2605.2300815.7
Predicted impact top 88% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in software engineering, this study provides empirical evidence on which comprehension proxies are reliable, helping to improve the validity of future studies.

This paper investigates the reliability of various code comprehension proxies used in prior work, finding that proxies based on input-output questions and response time are especially reliable, while those based on syntax questions are unreliable, questioning parts of existing literature.

Prior work on code comprehension uses different comprehension proxies-for example, Likert-scale ratings or answers to input-output questions about program snippets, usually collected from students, to approximate whether code is comprehensible to software engineers, but the relative reliability of these proxies is not known. This paper investigates the relative reliability of a collection of proxies common in the extant literature with a pair of human studies. First, we conducted an expert-consensus study with a panel of five professional software engineers to establish a ground-truth comprehensibility ranking of eight code snippets by adapting the Delphi expert-consensus protocol. The Delphi protocol is widely used for expert consensus under conditions of uncertainty in other domains, such as medicine and national-security forecasting, but to our knowledge, this is its first application in software engineering. Second, we conducted a study with 44 student participants who completed tasks, allowing us to measure 14 comprehension proxies derived from the literature on the same set of eight code snippets. Finally, we conducted a correlation analysis on the results, concluding that proxies 1) derived from input-output questions and 2) that measure response time rather than accuracy are especially reliable. We also found that proxies derived from questions about program syntax (rather than semantics) are especially unreliable, regardless of measurement strategy, which draws into question the reliability of parts of the existing comprehensibility literature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes