Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora
This work addresses the issue of noisy data for developers and researchers using neural source code models, but it is incremental as it applies existing data-influence methods to a new domain.
The paper tackles the problem of noise in source code corpora used to train neural models for software engineering tasks by adapting data-influence methods to detect noisy samples, with evaluation results showing these methods can identify such noise in classification-based tasks.
Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.