A study of the impact of generative AI-based data augmentation on software metadata classification
This work addresses the challenge of improving software metadata classification for developers and researchers, but it is incremental as it builds on existing methods with a specific dataset.
The study tackled the problem of predicting the usefulness of code-comment pairs in software metadata classification by developing a machine learning model using neural contextual representations, and it resulted in a 4% increase in F1-score from the baseline when incorporating LLM-generated data.
This paper presents the system submitted by the team from IIT(ISM) Dhanbad in FIRE IRSE 2023 shared task 1 on the automatic usefulness prediction of code-comment pairs as well as the impact of Large Language Model(LLM) generated data on original base data towards an associated source code. We have developed a framework where we train a machine learning-based model using the neural contextual representations of the comments and their corresponding codes to predict the usefulness of code-comments pair and performance analysis with LLM-generated data with base data. In the official assessment, our system achieves a 4% increase in F1-score from baseline and the quality of generated data.