PLApr 5, 2022
On the Transferability of Pre-trained Language Models for Low-Resource Programming LanguagesFuxiang Chen, Fatemeh Fard, David Lo et al.
A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual Pre-trained Language Models (PLMs) achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to fine-tuning monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLMs affect different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular Software Engineering tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths. In this work, we analyze over a hundred of pre-trained and fine-tuned models. Our results show that 1) multilingual PLMs have a lower Performance-to-Time Ratio (the BLEU, METEOR, or MRR scores over the fine-tuning duration) as compared to monolingual PLMs, 2) our proposed strategy to select target programming languages to fine-tune multilingual PLMs is effective: it reduces the time to fine-tune yet achieves higher performance in Code Summarization and Code Search tasks, and 3) our proposed strategy consistently shows good performance on different code lengths.
SEApr 5, 2022
An Exploratory Study on Code Attention in BERTRishab Sharma, Fuxiang Chen, Fatemeh Fard et al.
Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21-24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.
CLApr 5, 2022
LAMNER: Code Comment Generation Using Character Language Model and Named Entity RecognitionRishab Sharma, Fuxiang Chen, Fatemeh Fard
Code comment generation is the task of generating a high-level natural language description for a given code method or function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2) The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER's code representation with the baseline models, and the fused models consistently showed improvement over the non-fused models. The human evaluation further shows that LAMNER produces high-quality code comments.
SEJan 28Code
Context-Augmented Code Generation Using Programming Knowledge GraphsShahd Seddik, Fahd Seddik, Iman Saberi et al.
Large Language Models (LLMs) excel at code generation but struggle with complex problems. Retrieval-Augmented Generation (RAG) mitigates this issue by integrating external knowledge, yet retrieval models often miss relevant context, and generation models hallucinate with irrelevant data. We propose Programming Knowledge Graph (PKG) for semantic representation and fine-grained retrieval of code and text. Our approach enhances retrieval precision through tree pruning and mitigates hallucinations via a re-ranking mechanism that integrates non-RAG solutions. Structuring external data into finer-grained nodes improves retrieval granularity. Evaluations on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines on MBPP. Our findings demonstrate that our proposed PKG approach along with re-ranker effectively address complex problems while maintaining minimal negative impact on solutions that are already correct without RAG. The replication package is published at https://github.com/iamshahd/ProgrammingKnowledgeGraph
SEMar 3, 2023
APIContext2Com: Code Comment Generation by Incorporating Pre-Defined API DocumentationRamin Shahbazi, Fatemeh Fard
Code comments are significantly helpful in comprehending software programs and also aid developers to save a great deal of time in software maintenance. Code comment generation aims to automatically predict comments in natural language given a code snippet. Several works investigate the effect of integrating external knowledge on the quality of generated comments. In this study, we propose a solution, namely APIContext2Com, to improve the effectiveness of generated comments by incorporating the pre-defined Application Programming Interface (API) context. The API context includes the definition and description of the pre-defined APIs that are used within the code snippets. As the detailed API information expresses the functionality of a code snippet, it can be helpful in better generating the code summary. We introduce a seq-2-seq encoder-decoder neural network model with different sets of multiple encoders to effectively transform distinct inputs into target comments. A ranking mechanism is also developed to exclude non-informative APIs, so that we can filter out unrelated APIs. We evaluate our approach using the Java dataset from CodeSearchNet. The findings reveal that the proposed model improves the best baseline by 1.88 (8.24 %), 2.16 (17.58 %), 1.38 (18.3 %), 0.73 (14.17 %), 1.58 (14.98 %) and 1.9 (6.92 %) for BLEU1, BLEU2, BLEU3, BLEU4, METEOR, ROUGE-L respectively. Human evaluation and ablation studies confirm the quality of the generated comments and the effect of architecture and ranking APIs.
SEApr 7
A Survey of Algorithm Debt in Machine and Deep Learning Systems: Definition, Smells, and Future WorkEmmanuel Iko-Ojo Simon, Chirath Hettiarachchi, Fatemeh Fard et al.
The adoption of Machine and Deep Learning (ML/DL) technologies introduces maintenance challenges, leading to Technical Debt (TD). Algorithm Debt (AD) is a TD type that impacts the performance and scalability of ML/DL systems. A review of 42 primary studies expanded AD's definition, uncovered its implicit presence, identified its smells, and highlighted future directions. These findings will guide an AD-focused study, enhancing the reliability of ML/DL systems.
AINov 24, 2025Code
FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-CheckingRishab Sharma, Iman Saberi, Elham Alipour et al.
Financial applications of large language models (LLMs) require factual reliability and computational efficiency, yet current systems often hallucinate details and depend on prohibitively large models. We propose FISCAL (Financial Synthetic Claim-Document Augmented Learning), a modular framework for generating synthetic data tailored to financial fact-checking. Using FISCAL, we generate a dataset called FISCAL-data and use it to train MiniCheck-FISCAL, a lightweight verifier for numerical financial claims. MiniCheck-FISCAL outperforms its baseline, surpasses GPT-3.5 Turbo and other open-source peers of similar size, and approaches the accuracy of much larger systems (20x), such as Mixtral-8x22B and Command R+. On external datasets FinDVer and Fin-Fact, it rivals GPT-4o and Claude-3.5 while outperforming Gemini-1.5 Flash. These results show that domain-specific synthetic data, combined with efficient fine-tuning, enables compact models to achieve state-of-the-art accuracy, robustness, and scalability for practical financial AI. The dataset and scripts are available in the project repository (link provided in the paper).
SEMay 4
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program RepairShahd Seddik, Fatemeh Fard
Repository-level fault localization (FL) and automated program repair (APR) require an agent to identify the relevant code units across files, follow call and data dependencies, and generate a valid patch. Existing graph-based systems provide structural representations of repositories (files, classes, functions and their relationships) but do not model how variable values flow within procedures, leaving agents without the semantic precision needed for function- and line-level localization. We present ARISE (Agentic Repository-level Issue Solving Engine), which augments an LLM-based agent with a multi-granularity program graph that extends structural relationships down to statement-level nodes connected by intra-procedural definition-use edges. ARISE exposes this graph through a three-tier tool API, which brings data-flow slicing as a first-class, queryable agent primitive that allows the model to trace, in a single call, which statements define or consume a variable of interest. We evaluate on SWE-bench Lite (300 real GitHub issues, 11 Python repositories) using Qwen2.5-Coder-32B-Instruct as the backbone. Compared to the unmodified SWE-agent baseline, ARISE improves Function Recall@1 by 17.0 points and Line Recall@1 by 15.0 points. These localization gains translate directly into repair success, with ARISE achieving 22.0% Pass@1 (66/300), a 4.7 percentage-point improvement over SWE-agent. Controlled ablations confirm that the improvement is driven by the data-flow graph rather than the tool schema, and that large code models consume structured slice output directly without requiring a natural-language summarization layer. The graph builder and slicing API are designed as a framework-agnostic, drop-in toolset for future APR research.
SENov 3, 2025
Analysis of AdvFusion: Adapter-based Multilingual Learning for Code Large Language ModelsAmirreza Esmaeili, Fahd Seddik, Yongyi Ji et al.
Programming languages can benefit from one another by utilizing a language model for software engineering tasks. Full fine-tuning and Parameter Efficient Fine-Tuning (PEFT) of Code Language Models (Code-LMs) has been explored for multilingual knowledge transfer. AdapterFusion is a PEFT architecture that aims to enhance task performance by leveraging information from multiple programming languages, but primarily focuses on the target programming language. In our previous work, we proposed AdvFusion, a novel PEFT-based approach that effectively learns from other programming languages before adapting to the target task. Though previous experiments showed that AdvFusion outperformed AdapterFusion and LoRA, it was applied on pre-trained Code-LMs and was limited to only two tasks, code summarization and method name prediction. In this study, we expanded our work and investigated AdvFusion on Code Large Language Models (Code-LLMs), considering three new tasks: code generation, code translation, and commit message generation. We observed that different Code-LLMs/tasks exhibit different characteristics. In code generation, AdvFusion outperformed AdapterFusion but not other PEFT methods (LoRA, Compacter, and TaskAdapter). In commit message generation, AdapterFusion performed better than AdvFusion, and contrary to code generation, we found that the other PEFT methods do not have better performance. In code translation, AdvFusion performed worse than AdapterFusion overall, with the performance gap marginally widening as the model size increases. However, consistent with code generation, other PEFT methods showed better performance.
SENov 2, 2025
GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM AgentsJie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza et al.
With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.
SEApr 18
Bias in the Loop: Auditing LLM-as-a-Judge for Software EngineeringZixiao Zhao, Amirreza Esmaeili, Fatemeh Fard
Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it can help rank candidate solutions and guide patch selection. While attractive for scale, current practice lacks a principled account of reliability and bias: repeated evaluations of the same case can disagree; small prompt edits can swing outcomes; and seemingly semantics-preserving, human-equivalent perturbations may elicit divergent verdicts. This paper studies LLM-as-a-Judge for code through a measurement-first lens. We analyze two pointwise judging regimes across code generation, code repair task, and test generation, and we systematically probe prompt-induced biases. Our study considers difficulty levels for repeated runs and controlled prompt interventions that isolate one presentation cue at a time, and it evaluates judges using consistency and sensitivity to bias. We find that judge decisions are highly sensitive to prompt biases even when the underlying code snippet is unchanged. Across all three tasks, several biases systematically shift preferences toward the option favored by the prompt, improving accuracy when that option aligns with the gold answer but substantially reducing it otherwise. In some settings, these effects are large enough to change task-level conclusions and alter relative model rankings. These findings show that reported judge performance may reflect prompt artifacts rather than stable assessment ability, posing a direct threat to the validity and reproducibility of code evaluation. We therefore argue that LLM-as-a-Judge studies should report bias sensitivity alongside accuracy and incorporate explicit controls to support more trustworthy model comparison in software engineering.
SENov 14, 2021
FACOS: Finding API Relevant Contents on Stack Overflow with Semantic and Syntactic AnalysisKien Luong, Mohammad Hadi, Ferdian Thung et al.
Collecting API examples, usages, and mentions relevant to a specific API method over discussions on venues such as Stack Overflow is not a trivial problem. It requires efforts to correctly recognize whether the discussion refers to the API method that developers/tools are searching for. The content of the thread, which consists of both text paragraphs describing the involvement of the API method in the discussion and the code snippets containing the API invocation, may refer to the given API method. Leveraging this observation, we develop FACOS, a context-specific algorithm to capture the semantic and syntactic information of the paragraphs and code snippets in a discussion. FACOS combines a syntactic word-based score with a score from a predictive model fine-tuned from CodeBERT. FACOS beats the state-of-the-art approach by 13.9% in terms of F1-score.
SENov 12, 2020
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing CommitsSteffen Herbold, Alexander Trautsch, Benjamin Ledel et al.
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.