SE LGJan 27

High-quality data augmentation for code comment classification

Thomas Borsani, Andrea Rosani, Giuseppe Di Fatta

arXiv:2601.19383v12.9h-index: 19

Originality Synthesis-oriented

AI Analysis

This addresses dataset limitations for researchers and practitioners in software engineering using NLP for code understanding, though it appears incremental.

The paper tackles the problem of limited and imbalanced datasets for code comment classification by introducing new synthetic oversampling and augmentation techniques, achieving a 2.56% improvement over the base classifier.

Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers' insights about the surrounding source code, serving as an essential resource for both human comprehension and automated analysis. Nevertheless, since comments are in natural language, they present challenges for machine-based code understanding. To address this, recent studies have applied natural language processing (NLP) and deep learning techniques to classify comments according to developers' intentions. However, existing datasets for this task suffer from size limitations and class imbalance, as they rely on manual annotations and may not accurately represent the distribution of comments in real-world codebases. To overcome this issue, we introduce new synthetic oversampling and augmentation techniques based on high-quality data generation to enhance the NLBSE'26 challenge datasets. Our Synthetic Quality Oversampling Technique and Augmentation Technique (Q-SYNTH) yield promising results, improving the base classifier by $2.56\%$.

View on arXiv PDF

Similar