GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding
This addresses a gap in preparing training data for developers in code intelligence, though it appears incremental as it builds on existing code transformation techniques.
The paper tackles the problem of data augmentation for code data in deep learning-based code understanding by introducing GenCode, a generic framework that uses generation-and-selection to enhance training, resulting in models with 2.92% higher accuracy and 4.90% better robustness on average compared to the state-of-the-art method.
Pre-trained code models lead the era of code intelligence with multiple models have been designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code transformation techniques to generate new code candidates first and then selects important ones as the training data by importance metrics. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5). Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.