Thang Nguyen-Duc

8.6SEMay 25, 2022

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung et al.

Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.

26.2CLMay 2, 2023Code

Class based Influence Functions for Error Detection

Thang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran et al.

Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.

Thang Nguyen-Duc

2 Papers