SEMar 28, 2023
On Codex Prompt Engineering for OCL Generation: An Empirical StudySeif Abukhalaf, Mohammad Hamdaqa, Foutse Khomh
The Object Constraint Language (OCL) is a declarative language that adds constraints and object query expressions to MOF models. Despite its potential to provide precision and conciseness to UML models, the unfamiliar syntax of OCL has hindered its adoption. Recent advancements in LLMs, such as GPT-3, have shown their capability in many NLP tasks, including semantic parsing and text generation. Codex, a GPT-3 descendant, has been fine-tuned on publicly available code from GitHub and can generate code in many programming languages. We investigate the reliability of OCL constraints generated by Codex from natural language specifications. To achieve this, we compiled a dataset of 15 UML models and 168 specifications and crafted a prompt template with slots to populate with UML information and the target task, using both zero- and few-shot learning methods. By measuring the syntactic validity and execution accuracy metrics of the generated OCL constraints, we found that enriching the prompts with UML information and enabling few-shot learning increases the reliability of the generated OCL constraints. Furthermore, the results reveal a close similarity based on sentence embedding between the generated OCL constraints and the human-written ones in the ground truth, implying a level of clarity and understandability in the generated OCL constraints by Codex.
27.9SEMar 15Code
Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat IntelligenceDectot--Le Monnier de Gouville Esteban, Mohammad Hamdaqa, Moataz Chouchen
YARA has established itself as the de facto standard for "Detection as Code," enabling analysts and DevSecOps practitioners to define signatures for malware identification across the software supply chain. Despite its pervasive use, the open-source YARA ecosystem remains characterized by ad-hoc sharing and opaque quality. Practitioners currently rely on public repositories without empirical evidence regarding the ecosystem's structural characteristics, maintenance and diffusion dynamics, or operational reliability. We conducted a large-scale mixed-method study of 8.4 million rules mined from 1,853 GitHub repositories. Our pipeline integrates repository mining to map supply chain dynamics, static analysis to assess syntactic quality, and dynamic benchmarking against 4,026 malware and 2,000 goodware samples to measure operational effectiveness. We reveal a highly centralized structure where 10 authors drive 80% of rule adoption. The ecosystem functions as a "static supply chain": repositories show a median inactivity of 782 days and a median technical lag of 4.2 years. While static quality scores appear high (mean = 99.4/100), operational benchmarking uncovers significant noise (false positives) and low recall. Furthermore, coverage is heavily biased toward legacy threats (Ransomware), leaving modern initial access vectors (Loaders, Stealers) severely underrepresented. These findings expose a systemic "double penalty": defenders incur high performance overhead for decayed intelligence. We argue that public repositories function as raw data dumps rather than curated feeds, necessitating a paradigm shift from ad-hoc collection to rigorous rule engineering. We release our dataset and pipeline to support future data-driven curation tools.
32.6SEMar 22
Dynasto: Validity-Aware Dynamic-Static Parameter Optimization for Autonomous Driving TestingDmytro Humeniuk, Mohammad Hamdaqa, Houssem Ben Braiek et al.
Extensive simulation-based testing is important for assuring the safety of autonomous driving systems (ADS). However, generating safety-critical traffic scenarios remains challenging because failures often arise from rare, complex interactions with surrounding vehicles. Existing automatic scenario-generation approaches frequently fail to distinguish genuine ADS faults from collisions caused by implausible or invalid adversarial behaviors, and they typically optimize either scenario initialization or agent behavior in isolation. We propose Dynasto, a two-step testing approach that jointly optimizes initial scenario parameters and dynamic adversarial behaviors to uncover realistic safety-critical failures. First, we train an adversarial agent using reinforcement learning (RL) with temporal-logic-based validity criteria and a safe-distance model inspired by ISO 34502 to promote behaviorally plausible failures. Second, a genetic algorithm (GA) searches over initial conditions while replaying the adversary's failure-inducing behaviors to reveal additional failures that the RL agent alone does not uncover. Finally, a graph-based clustering pipeline groups failures into representative modes based on semantic event sequences. Our evaluation experiments in HighwayEnv across two ADS controllers show that Dynasto finds 60%-70% more valid failures than an RL-only adversary under the same evaluation budget. With clustering, we obtain about 12 interpretable failure modes per system under test, revealing valid failures driven by weaknesses in ego-controller behavior. These results indicate that coordinated dynamic-static optimization with explicit validity constraints is effective for exposing safety-relevant failures in ADS testing.
95.2LGMay 8Code
Rotation-Preserving Supervised Fine-TuningHangzhan Jin, Tianwei Ni, Lu Li et al.
Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-$k$ singular-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in-domain/OOD trade-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine-tuning. Code is available at \href{https://github.com/jinhangzhan/RPSFT.git}{https://github.com/jinhangzhan/RPSFT}.
LGSep 8, 2025Code
RL Fine-Tuning Heals OOD Forgetting in SFTHangzhan Jin, Sitao Luan, Sicheng Lyu et al.
The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim "SFT memorizes, RL generalizes" is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT
9.9SEApr 28
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability DetectionMohamed Taoufik Kaouthar El Idrissi, Edward Zulkoski, Mohammad Hamdaqa
Code understanding models increasingly rely on pretrained language models (PLMs) and graph neural networks (GNNs), which capture complementary semantic and structural information. We conduct a controlled empirical study of PLM-GNN hybrids for code classification and vulnerability detection tasks by systematically pairing three code-specialized PLMs with three foundational GNN architectures. We compare these hybrids against PLM-only and GNN-only baselines on Java250 and Devign, including an identifier-obfuscation setting. Across both tasks, hybrids consistently outperform GNN-only baselines and often improve ranking quality over frozen PLMs. On Devign, performance and robustness are more sensitive to the PLM feature source than to the GNN backbone. We also find that larger PLMs are not necessarily better feature extractors in this pipeline, and that the PLM choice has more impact than the GNN choice. Finally, we distill these findings into practical guidelines for PLM-GNN design choices in code classification and vulnerability detection.
SEMay 21, 2024
PathOCL: Path-Based Prompt Augmentation for OCL Generation with GPT-4Seif Abukhalaf, Mohammad Hamdaqa, Foutse Khomh
The rapid progress of AI-powered programming assistants, such as GitHub Copilot, has facilitated the development of software applications. These assistants rely on large language models (LLMs), which are foundation models (FMs) that support a wide range of tasks related to understanding and generating language. LLMs have demonstrated their ability to express UML model specifications using formal languages like the Object Constraint Language (OCL). However, the context size of the prompt is limited by the number of tokens an LLM can process. This limitation becomes significant as the size of UML class models increases. In this study, we introduce PathOCL, a novel path-based prompt augmentation technique designed to facilitate OCL generation. PathOCL addresses the limitations of LLMs, specifically their token processing limit and the challenges posed by large UML class models. PathOCL is based on the concept of chunking, which selectively augments the prompts with a subset of UML classes relevant to the English specification. Our findings demonstrate that PathOCL, compared to augmenting the complete UML class model (UML-Augmentation), generates a higher number of valid and correct OCL constraints using the GPT-4 model. Moreover, the average prompt size crafted using PathOCL significantly decreases when scaling the size of the UML class models.
LGAug 22, 2025
RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMsHangzhan Jin, Sicheng Lv, Sifan Wu et al.
Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3) Low-rank and shallow recovery is effective: restoring singular vector directions for the top 20% of values or first 25% of layers recovers 70-80% of OOD performance. (4) Stronger SFT checkpoints enable better recovery by RL, while overfitted ones resist restoration. These results reconcile prior reports of RL superior OOD performance: RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank UV merging and shallow-layer resets that practitioners can use before costly RL fine-tuning.
SEMay 8, 2025
PRIMG : Efficient LLM-driven Test Generation Using Mutant PrioritizationMohamed Salah Bouafif, Mohammad Hamdaqa, Edward Zulkoski
Mutation testing is a widely recognized technique for assessing and enhancing the effectiveness of software test suites by introducing deliberate code mutations. However, its application often results in overly large test suites, as developers generate numerous tests to kill specific mutants, increasing computational overhead. This paper introduces PRIMG (Prioritization and Refinement Integrated Mutation-driven Generation), a novel framework for incremental and adaptive test case generation for Solidity smart contracts. PRIMG integrates two core components: a mutation prioritization module, which employs a machine learning model trained on mutant subsumption graphs to predict the usefulness of surviving mutants, and a test case generation module, which utilizes Large Language Models (LLMs) to generate and iteratively refine test cases to achieve syntactic and behavioral correctness. We evaluated PRIMG on real-world Solidity projects from Code4Arena to assess its effectiveness in improving mutation scores and generating high-quality test cases. The experimental results demonstrate that PRIMG significantly reduces test suite size while maintaining high mutation coverage. The prioritization module consistently outperformed random mutant selection, enabling the generation of high-impact tests with reduced computational effort. Furthermore, the refining process enhanced the correctness and utility of LLM-generated tests, addressing their inherent limitations in handling edge cases and complex program logic.
SEMar 29, 2025
CCCI: Code Completion with Contextual Information for Complex Data Transfer Tasks Using Large Language ModelsHangzhan Jin, Mohammad Hamdaqa
Unlike code generation, which involves creating code from scratch, code completion focuses on integrating new lines or blocks of code into an existing codebase. This process requires a deep understanding of the surrounding context, such as variable scope, object models, API calls, and database relations, to produce accurate results. These complex contextual dependencies make code completion a particularly challenging problem. Current models and approaches often fail to effectively incorporate such context, leading to inaccurate completions with low acceptance rates (around 30\%). For tasks like data transfer, which rely heavily on specific relationships and data structures, acceptance rates drop even further. This study introduces CCCI, a novel method for generating context-aware code completions specifically designed to address data transfer tasks. By integrating contextual information, such as database table relationships, object models, and library details into Large Language Models (LLMs), CCCI improves the accuracy of code completions. We evaluate CCCI using 289 Java snippets, extracted from over 819 operational scripts in an industrial setting. The results demonstrate that CCCI achieved a 49.1\% Build Pass rate and a 41.0\% CodeBLEU score, comparable to state-of-the-art methods that often struggle with complex task completion.
38.6AIApr 1
Adversarial Moral Stress Testing of Large Language ModelsSaeid Jamshidi, Foutse Khomh, Arghavan Moradi Dakhel et al.
Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. Existing safety benchmarks typically rely on single-round evaluations and aggregate metrics, such as toxicity scores and refusal rates, which offer limited visibility into behavioral instability that may arise during realistic multi-turn interactions. As a result, rare but high-impact ethical failures and progressive degradation effects may remain undetected prior to deployment. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics that capture variance, tail risk, and temporal behavioral drift across interaction rounds. We evaluate AMST on several state-of-the-art LLMs, including LLaMA-3-8B, GPT-4o, and DeepSeek-v3, using a large set of adversarial scenarios generated under controlled stress conditions. The results demonstrate substantial differences in robustness profiles across models and expose degradation patterns that are not observable under conventional single-round evaluation protocols. In particular, robustness has been shown to depend on distributional stability and tail behavior rather than on average performance alone. Additionally, AMST provides a scalable and model-agnostic stress-testing methodology that enables robustness-aware evaluation and monitoring of LLM-enabled software systems operating in adversarial environments.
40.9CRApr 1
Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT DefenseSaeid Jamshidi, Negar Shahabi, Foutse Khomh et al.
Software-Defined Networking (SDN) is increasingly adopted to secure Internet-of-Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN-IoT defense is inherently a closed-loop control problem in which mitigation actions impact controller workload, queue dynamics, rule-installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning-based approaches prioritize detection accuracy while neglecting controller coupling and short-horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self-reflective two-timescale SDN-IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per-switch Proximal Policy Optimization (PPO) agents perform controller-aware mitigation under safety constraints and action masking. At the slow timescale, a multi-agent Large Language Model (LLM) governance engine generates machine-parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non-regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro-F1 over PPO and 15.4% over static baselines. Worst-case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high-intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.
SEOct 4, 2025
Refactoring with LLMs: Bridging Human Expertise and Machine UnderstandingYonnel Chen Kuang Piao, Jean Carlors Paul, Leuson Da Silva et al.
Code refactoring is a fundamental software engineering practice aimed at improving code quality and maintainability. Despite its importance, developers often neglect refactoring due to the significant time, effort, and resources it requires, as well as the lack of immediate functional rewards. Although several automated refactoring tools have been proposed, they remain limited in supporting a broad spectrum of refactoring types. In this study, we explore whether instruction strategies inspired by human best-practice guidelines can enhance the ability of Large Language Models (LLMs) to perform diverse refactoring tasks automatically. Leveraging the instruction-following and code comprehension capabilities of state-of-the-art LLMs (e.g., GPT-mini and DeepSeek-V3), we draw on Martin Fowler's refactoring guidelines to design multiple instruction strategies that encode motivations, procedural steps, and transformation objectives for 61 well-known refactoring types. We evaluate these strategies on benchmark examples and real-world code snippets from GitHub projects. Our results show that instruction designs grounded in Fowler's guidelines enable LLMs to successfully perform all benchmark refactoring types and preserve program semantics in real-world settings, an essential criterion for effective refactoring. Moreover, while descriptive instructions are more interpretable to humans, our results show that rule-based instructions often lead to better performance in specific scenarios. Interestingly, allowing models to focus on the overall goal of refactoring, rather than prescribing a fixed transformation type, can yield even greater improvements in code quality.
SEDec 21, 2021
Chat2Code: Towards conversational concrete syntax for model specification and code generation, the case of smart contractsIlham Qasse, Shailesh Mishra, Mohammad Hamdaqa
The revolutionary potential of automatic code generation tools based on Model-Driven Engineering (MDE) frameworks has yet to be realized. Beyond their ability to help software professionals write more accurate, reusable code, they could make programming accessible for a whole new class of non-technical users. However, non-technical users have been slow to embrace these tools. This may be because their concrete syntax is often patterned after the operations of textual or graphical interfaces. The interfaces are common, but users would need more extensive, precise and detailed knowledge of them than they can be assumed to have, to use them as concrete syntax. Conversational interfaces (chatbots) offer a much more accessible way for non-technical users to generate code. In this paper, we discuss the basic challenge of integrating conversational agents within Model-Driven Engineering (MDE) frameworks, then turn to look at a specific application: the auto-generation of smart contract code in multiple languages by non-technical users, based on conversational syntax. We demonstrate how this can be done, and evaluate our approach by conducting user experience survey to assess the usability and functionality of the chatbot framework.
SEMar 16, 2021
iContractBot: A Chatbot for Smart Contracts' Specification and Code GenerationIlham Qasse, Shailesh Mishra, Mohammad Hamdaqa
Recently, Blockchain technology adoption has expanded to many application areas due to the evolution of smart contracts. However, developing smart contracts is non-trivial and challenging due to the lack of tools and expertise in this field. A promising solution to overcome this issue is to use Model-Driven Engineering (MDE), however, using models still involves a learning curve and might not be suitable for non-technical users. To tackle this challenge, chatbot or conversational interfaces can be used to assess the non-technical users to specify a smart contract in gradual and interactive manner. In this paper, we propose iContractBot, a chatbot for modeling and developing smart contracts. Moreover, we investigate how to integrate iContractBot with iContractML, a domain-specific modeling language for developing smart contracts, and instantiate intention models from the chatbot. The iContractBot framework provides a domain-specific language (DSL) based on the user intention and performs model-to-text transformation to generate the smart contract code. A smart contract use case is presented to demonstrate how iContractBot can be utilized for creating models and generating the deployment artifacts for smart contracts based on a simple conversation.
SEMar 15, 2021
EnHMM: On the Use of Ensemble HMMs and Stack Traces to Predict the Reassignment of Bug Report FieldsMd Shariful Islam, Abdelwahab Hamou-Lhadj, Korosh K. Sabor et al.
Bug reports (BR) contain vital information that can help triaging teams prioritize and assign bugs to developers who will provide the fixes. However, studies have shown that BR fields often contain incorrect information that need to be reassigned, which delays the bug fixing process. There exist approaches for predicting whether a BR field should be reassigned or not. These studies use mainly BR descriptions and traditional machine learning algorithms (SVM, KNN, etc.). As such, they do not fully benefit from the sequential order of information in BR data, such as function call sequences in BR stack traces, which may be valuable for improving the prediction accuracy. In this paper, we propose a novel approach, called EnHMM, for predicting the reassignment of BR fields using ensemble Hidden Markov Models (HMMs), trained on stack traces. EnHMM leverages the natural ability of HMMs to represent sequential data to model the temporal order of function calls in BR stack traces. When applied to Eclipse and Gnome BR repositories, EnHMM achieves an average precision, recall, and F-measure of 54%, 76%, and 60% on Eclipse dataset and 41%, 69%, and 51% on Gnome dataset. We also found that EnHMM improves over the best single HMM by 36% for Eclipse and 76% for Gnome. Finally, when comparing EnHMM to Im.ML.KNN, a recent approach in the field, we found that the average F-measure score of EnHMM improves the average F-measure of Im.ML.KNN by 6.80% and improves the average recall of Im.ML.KNN by 36.09%. However, the average precision of EnHMM is lower than that of Im.ML.KNN (53.93% as opposed to 56.71%).