SEApr 3

A Multi-Language Perspective on the Robustness of LLM Code Generation

arXiv:2504.1910811.29 citationsh-index: 6

Predicted impact top 61% in SE · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses the robustness evaluation of LLMs in code generation for multiple languages, but it is incremental as it extends existing testing methods to new domains.

The study assessed the robustness of code generation models across multiple programming languages by introducing perturbations in prompts, finding that all models degraded under perturbations with varying severity, and that LLM-based docstring repair provided only marginal improvements.

Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the robustness of code generation models remains an ongoing endeavor. Previous studies have primarily focused on code generation models specifically for the Python language, overlooking other widely used programming languages. In this work, we conduct a comprehensive comparative analysis to assess the robustness performance of several prominent code generation models and investigate whether robustness can be improved by repairing perturbed docstrings using an LLM. Furthermore, we investigate how their performance varies across different programming languages. To accomplish this, we introduce perturbations in four key areas of the prompt: DocString, function name, syntax, and format. We have compiled and released a dedicated dataset for this purpose. Our results show that all models consistently degrade under perturbations across all three languages, but vary in magnitude depending on the language and perturbation type. Larger model size does not reliably improve robustness, and semantic perturbations prove at least as disruptive as syntactic ones. Our LLM-based docString repair yields only marginal gains for simple perturbations and can degrade performance for semantic ones, highlighting the limits of prompt-level mitigation.

View on arXiv PDF

Similar