Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3
It provides guidance for practitioners on selecting efficient, cost-effective LLMs for automated code smell detection, but is incremental as it benchmarks existing models on a new dataset.
This study benchmarks OpenAI GPT-4.0 and DeepSeek-V3 for detecting code smells across Java, Python, JavaScript, and C++, evaluating them with precision, recall, and F1 scores, and includes a cost analysis compared to tools like SonarQube.
Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection