SE AI PLMay 25, 2022

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi D. Q. Bui

arXiv:2205.13022v28.66 citationsh-index: 17

Originality Synthesis-oriented

AI Analysis

This work addresses the issue of noisy data for developers and researchers using neural source code models, but it is incremental as it applies existing data-influence methods to a new domain.

The paper tackles the problem of noise in source code corpora used to train neural models for software engineering tasks by adapting data-influence methods to detect noisy samples, with evaluation results showing these methods can identify such noise in classification-based tasks.

Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.

View on arXiv PDF

Similar