SE CLMay 7

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

Ion George Dinu, Marian Cristian Mihăescu, Traian Rebedea

arXiv:2605.070018.01 citations

Predicted impact top 60% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For software engineering researchers and practitioners, this work provides the first benchmark and reveals that current LLM agents lack the architectural understanding needed for cross-module refactoring.

The paper evaluates LLM agents on repairing architectural code smells, finding that the best agent achieves a 47.7% resolution rate but aggressive repair introduces up to 140 new smells, highlighting a gap in cross-module refactoring capabilities.

Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $κ= 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.

View on arXiv PDF

Similar