65.3SEMay 2
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?Soumit Kanti Saha, Fazle Rabbi, Song Wang et al.
Large Language Models (LLMs) are increasingly being applied across various domains, including code-related tasks such as code translation. Previous studies have explored using LLMs for translating code between different programming languages. Since LLMs are more effective with natural language, using natural language as an intermediate representation in code translation tasks is an intuitively appealing approach. However, whether this benefit is general or highly context-dependent remains unclear. In this work, we investigate using NL-specification as an intermediate representation for code translation. We evaluate our method using three datasets, five popular programming languages, and 29 language pair permutations. Our results show that using NL-specification alone does not lead to performance improvements. However, when combined with source code, it provides gains in certain language pairs (notably with Python and C++ as source languages), while offering no consistent improvement overall. Besides analyzing the performance of code translation, we also investigate the quality of the translated code and provide insights into the issues present in the translated code.
88.3SEMay 1
Social Bias in LLM-Generated Code: Benchmark and MitigationFazle Rabbi, Lin Ling, Song Wang et al.
Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unexamined. Extending our prior work on Solar, we conduct a comprehensive empirical study using SocialBias-Bench, a benchmark of 343 real-world coding tasks spanning seven demographic dimensions. We evaluate four prominent LLMs and find severe bias across all models, with Code Bias Scores reaching up to 60.58%. We further show that standard prompt-level interventions, such as Chain-of-Thought reasoning and fairness persona assignment, inadvertently amplify bias rather than reduce it. We then investigate whether structured multi-agent software process frameworks can improve fairness, finding that structured pipelines reduce bias when early roles correctly scope what the code should and should not consider. However, adding explicit fairness instructions to all agent roles produces worse outcomes than providing none, suggesting that diffused responsibility goes unaddressed. To address these limitations, we propose the Fairness Monitor Agent (FMA), a modular component that plugs into any existing code generation pipeline without modifying it. FMA analyzes the task description to determine which attributes should be considered or restricted, then detects and corrects violations through an iterative review process, without requiring an executable test suite. Evaluated on all 343 tasks, FMA reduces bias by 65.1% compared to a developer agent alone and improves functional correctness from 75.80% to 83.97%, outperforming all other studied approaches.
38.4SEApr 3
A Multi-Language Perspective on the Robustness of LLM Code GenerationFazle Rabbi, Zishuo Ding, Jinqiu Yang
Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the robustness of code generation models remains an ongoing endeavor. Previous studies have primarily focused on code generation models specifically for the Python language, overlooking other widely used programming languages. In this work, we conduct a comprehensive comparative analysis to assess the robustness performance of several prominent code generation models and investigate whether robustness can be improved by repairing perturbed docstrings using an LLM. Furthermore, we investigate how their performance varies across different programming languages. To accomplish this, we introduce perturbations in four key areas of the prompt: DocString, function name, syntax, and format. We have compiled and released a dedicated dataset for this purpose. Our results show that all models consistently degrade under perturbations across all three languages, but vary in magnitude depending on the language and perturbation type. Larger model size does not reliably improve robustness, and semantic perturbations prove at least as disruptive as syntactic ones. Our LLM-based docString repair yields only marginal gains for simple perturbations and can degrade performance for semantic ones, highlighting the limits of prompt-level mitigation.
SEDec 7, 2025
BabelCoder: Agentic Code Translation with Specification AlignmentFazle Rabbi, Soumit Kanti Saha, Tri Minh Triet Pham et al.
As software systems evolve, developers increasingly work across multiple programming languages and often face the need to migrate code from one language to another. While automatic code translation offers a promising solution, it has long remained a challenging task. Recent advancements in Large Language Models (LLMs) have shown potential for this task, yet existing approaches remain limited in accuracy and fail to effectively leverage contextual and structural cues within the code. Prior work has explored translation and repair mechanisms, but lacks a structured, agentic framework where multiple specialized agents collaboratively improve translation quality. In this work, we introduce BabelCoder, an agentic framework that performs code translation by decomposing the task into specialized agents for translation, testing, and refinement, each responsible for a specific aspect such as generating code, validating correctness, or repairing errors. We evaluate BabelCoder on four benchmark datasets and compare it against four state-of-the-art baselines. BabelCoder outperforms existing methods by 0.5%-13.5% in 94% of cases, achieving an average accuracy of 94.16%.
SEAug 12, 2021Code
SysMap: A Lightweight Software Visualization Tool to Analyze the Software Evolution of a SystemFazle Rabbi, Nishat Tasnim Niloy, Nadia Nahar et al.
Software visualization helps to comprehend the system by providing a vivid illustration. The developers, as well as the analysts, can have a glance over the total system to understand the basic changes over time from a high-level point of view through this technique. In recent years, many tools are proposed to visualize software based on different architectural metaphors, such as as- solar system, city or park. Some of the solutions have just worked on system visualization where a few tried to explain the changes in software throughout different versions that still need heavy manual work. Keeping such limitations in mind, this paper proposes a lightweight tool named SysMap that takes the source codes of different versions of software systems, provides 3D illustrations of those systems and a graphical statistic of its evolution. To build the graphical element to represent the system, the source code has been studied to find out different software metrics. For experimentation, several open-source java projects were chosen to find out the necessary information. Henceforth, this tool will surely increase the work efficiency of both the developer and analyst by reducing the manual effort and by providing the graphical view to comprehend the software evolution over time.
25.6SEMay 4
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program RepairFazle Rabbi, Jinqiu Yang
Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.
61.7SEMay 4
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code TranslationFazle Rabbi, Soumit Kanti Saha, Jinqiu Yang
Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frameworks has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration-aware evaluation standards to accurately assess progress in LLM-based code translation.