SELGMay 30, 2025

Applying Large Language Models to Issue Classification: Revisiting with Extended Data and New Models

arXiv:2506.00128v13 citationsh-index: 8Has CodeSci Comput Program
Originality Synthesis-oriented
AI Analysis

This work addresses the labor-intensive and data-hungry nature of issue classification for software developers, though it is incremental as it applies existing LLMs to a known task with extended data.

The research tackled the problem of automating issue classification in software engineering by applying large language models (LLMs), finding that GPT-4o outperformed DeepSeek R1 with an F1 score 20% higher, achieving 80.7% compared to 59.33% on a specific dataset.

Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while also maintaining reliability in classification. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes