SE AIFeb 19, 2024

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Thanh Le-Cong, Dat Nguyen, Bach Le, Toby Murray

arXiv:2402.11892v27.09 citationsh-index: 12Has CodeACM Trans Softw Eng Methodol

Originality Incremental advance

AI Analysis

This addresses evaluation reliability for researchers and practitioners in automated software repair, though it is incremental as it builds on existing robustness testing methods.

The paper tackles the problem of unreliable robustness evaluation in Neural Program Repair (NPR) by shifting focus to naturally-occurring data transformations, finding that only 60% of transformations are natural and that unnaturalness significantly impacts benchmark applicability and testing conclusions, with experiments showing substantial prediction changes and reductions in plausible and correct patch rates.

In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.

View on arXiv PDF Code

Similar