SEMay 28, 2022
Syntax-Guided Program Reduction for Understanding Neural Code Intelligence ModelsMd Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour
Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.
SEJul 6, 2024
Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance ComputingRabimba Karanjai, Aftab Hussain, Md Rafiqul Islam Rabin et al.
Unit testing is crucial in software engineering for ensuring quality. However, it's not widely used in parallel and high-performance computing software, particularly scientific applications, due to their smaller, diverse user base and complex logic. These factors make unit testing challenging and expensive, as it requires specialized knowledge and existing automated tools are often ineffective. To address this, we propose an automated method for generating unit tests for such software, considering their unique features like complex logic and parallel processing. Recently, large language models (LLMs) have shown promise in coding and testing. We explored the capabilities of Davinci (text-davinci-002) and ChatGPT (gpt-3.5-turbo) in creating unit tests for C++ parallel programs. Our results show that LLMs can generate mostly correct and comprehensive unit tests, although they have some limitations, such as repetitive assertions and blank test cases.
LGMar 8, 2023
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain, Md Rafiqul Islam Rabin, Bowen Xu et al.
Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introduced in the works of Sajaniemi et al. [Refs. 1,2]) has been found to help students' abilities in programming. In this paper, we investigate if this notion would improve the performance of neural models of code. To the best of our knowledge, this is the first work to investigate how Sajaniemi et al.'s concept of variable roles can affect neural models of code. In particular, we enrich a source code dataset by adding the role of individual variables in the dataset programs, and thereby conduct a study on the impact of variable role enrichment in training the Code2Seq model. In addition, we shed light on some challenges and opportunities in feature enrichment for neural code intelligence models.
LGMar 3, 2023
Study of Distractors in Neural Models of CodeMd Rafiqul Islam Rabin, Aftab Hussain, Sahil Suneja et al.
Finding important features that contribute to the prediction of neural models is an active area of research in explainable AI. Neural models are opaque and finding such features sheds light on a better understanding of their predictions. In contrast, in this work, we present an inverse perspective of distractor features: features that cast doubt about the prediction by affecting the model's confidence in its prediction. Understanding distractors provide a complementary view of the features' relevance in the predictions of neural models. In this paper, we apply a reduction-based technique to find distractors and provide our preliminary results of their impacts and types. Our experiments across various tasks, models, and datasets of code reveal that the removal of tokens can have a significant impact on the confidence of models in their predictions and the categories of tokens can also play a vital role in the model's confidence. Our study aims to enhance the transparency of models by emphasizing those tokens that significantly influence the confidence of the models.
SEAug 22, 2024
Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source CodeMahdi Kazemi, Aftab Hussain, Md Rafiqul Islam Rabin et al.
This work investigates the application of Machine Unlearning (MU) for mitigating the impact of trojans embedded in conventional large language models of natural language (Text-LLMs) and large language models of code (Code-LLMs) We propose a novel unlearning approach, LYA, that leverages both gradient ascent and elastic weight consolidation, a Fisher Information Matrix (FIM) based regularization technique, to unlearn trojans from poisoned models. We compare the effectiveness of LYA against conventional techniques like fine-tuning, retraining, and vanilla gradient ascent. The subject models we investigate are BERT and CodeBERT, for sentiment analysis and code defect detection tasks, respectively. Our findings demonstrate that the combination of gradient ascent and FIM-based regularization, as done in LYA, outperforms existing methods in removing the trojan's influence from the poisoned model, while preserving its original functionality. To the best of our knowledge, this is the first work that compares and contrasts MU of trojans in LLMs, in the NL and Coding domain.
SEDec 25, 2021Code
FMViz: Visualizing Tests Generated by AFL at the Byte-levelAftab Hussain, Mohammad Amin Alipour
Software fuzzing is a strong testing technique that has become the de facto approach for automated software testing and software vulnerability detection in the industry. The random nature of fuzzing makes monitoring and understanding the behavior of fuzzers difficult. In this paper, we report the development of Fuzzer Mutation Visualizer (FMViz), a tool that focuses on visualizing byte-level mutations in fuzzers. In particular, FMViz extends American Fuzzy Lop (AFL) to visualize the generated test inputs and highlight changes between consecutively generated seeds as a fuzzing campaign progresses. The overarching goal of our tool is to help developers and students comprehend the inner-workings of the AFL fuzzer better. In this paper, we present the architecture of FMViz, discuss a sample case study of it, and outline the future work. FMViz is open-source and publicly available at https://github.com/AftabHussain/afl-test-viz.
HCApr 25, 2025
From Prompts to Propositions: A Logic-Based Lens on Student-LLM InteractionsAli Alfageeh, Sadegh AlMahdi Kazemi Zarkouei, Daye Nam et al.
Background and Context. The increasing integration of large language models (LLMs) in computing education presents an emerging challenge in understanding how students use LLMs and craft prompts to solve computational tasks. Prior research has used both qualitative and quantitative methods to analyze prompting behavior, but these approaches lack scalability or fail to effectively capture the semantic evolution of prompts. Objective. In this paper, we investigate whether students prompts can be systematically analyzed using propositional logic constraints. We examine whether this approach can identify patterns in prompt evolution, detect struggling students, and provide insights into effective and ineffective strategies. Method. We introduce Prompt2Constraints, a novel method that translates students prompts into logical constraints. The constraints are able to represent the intent of the prompts in succinct and quantifiable ways. We used this approach to analyze a dataset of 1,872 prompts from 203 students solving introductory programming tasks. Findings. We find that while successful and unsuccessful attempts tend to use a similar number of constraints overall, when students fail, they often modify their prompts more significantly, shifting problem-solving strategies midway. We also identify points where specific interventions could be most helpful to students for refining their prompts. Implications. This work offers a new and scalable way to detect students who struggle in solving natural language programming tasks. This work could be extended to investigate more complex tasks and integrated into programming tools to provide real-time support.
SEMay 5, 2024
Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based TaxonomyAftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed et al.
Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents an overview of the current state-of-the-art trojan attacks on large language models of code, with a focus on triggers -- the main design point of trojans -- with the aid of a novel unifying trigger taxonomy framework. We also aim to provide a uniform definition of the fundamental concepts in the area of trojans in Code LLMs. Finally, we draw implications of findings on how code models learn on trigger design.
CRFeb 23, 2024
On Trojan Signatures in Large Language Models of CodeAftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour
Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in the classifier layer parameters of large language models of source code. Our results suggest that trojan signatures could not generalize to LLMs of code. We found that trojaned code models are stubborn, even when the models were poisoned under more explicit settings (finetuned with pre-trained weights frozen). We analyzed nine trojaned models for two binary classification tasks: clone and defect detection. To the best of our knowledge, this is the first work to examine weight-based trojan signature revelation techniques for large-language models of code and furthermore to demonstrate that detecting trojans only from the weights in such models is a hard problem.
SEDec 25, 2021
DIAR: Removing Uninteresting Bytes from Seeds in Software FuzzingAftab Hussain, Mohammad Amin Alipour
Software fuzzing mutates bytes in the test seeds to explore different behaviors of the program under test. Initial seeds can have great impact on the performance of a fuzzing campaign. Mutating a lot of uninteresting bytes in a large seed wastes the fuzzing resources. In this paper, we present the preliminary results of our approach that aims to improve the performance of fuzzers through identifying and removing uninteresting bytes in the seeds. In particular, we present DIAR, a technique that reduces the size of the seeds based on their coverage. Our preliminary results suggest fuzzing campaigns that start with reduced seeds, find new paths faster, and can produce higher coverage overall.
SENov 1, 2021
Code2Snapshot: Using Code Snapshots for Learning Representations of Source CodeMd Rafiqul Islam Rabin, Mohammad Amin Alipour
There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and compare its performance with state-of-the-art representations that utilize the rich syntactic and semantic features of input programs. Our preliminary study on the utility of Code2Snapshot in the code summarization and code classification tasks suggests that simple snapshots of input programs have comparable performance to state-of-the-art representations. Interestingly, obscuring input programs have insignificant impacts on the Code2Snapshot performance, suggesting that, for some tasks, neural models may provide high performance by relying merely on the structure of input programs.
LGJun 16, 2021
Memorization and Generalization in Neural Code Intelligence ModelsMd Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour et al.
Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers.
SEJun 7, 2021
Understanding Neural Code Intelligence Through Program SimplificationMd Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour
A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.
SEDec 19, 2020
Configuring Test Generators using Bug Reports: A Case Study of GCC Compiler and CsmithMd Rafiqul Islam Rabin, Mohammad Amin Alipour
The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce executables that do not reflect the intent of programmers. Such errors are difficult to identify and debug. Random test program generators are commonly used in testing compilers, and they have been effective in uncovering bugs. However, the problem of guiding these test generators to produce test programs that are more likely to find bugs remains challenging. In this paper, we use the code snippets in the bug reports to guide the test generation. The main idea of this work is to extract insights from the bug reports about the language features that are more prone to inadequate implementation and using the insights to guide the test generators. We use the GCC C compiler to evaluate the effectiveness of this approach. In particular, we first cluster the test programs in the GCC bugs reports based on their features. We then use the centroids of the clusters to compute configurations for Csmith, a popular test generator for C compilers. We evaluated this approach on eight versions of GCC and found that our approach provides higher coverage and triggers more miscompilation failures than the state-of-the-art test generation techniques for GCC.
LGAug 29, 2020
Towards Demystifying Dimensions of Source Code EmbeddingsMd Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali et al.
Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.
SEJul 31, 2020
On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program TransformationsMd Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang et al.
With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement.
SEApr 15, 2020
Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving TransformationsMd Rafiqul Islam Rabin, Mohammad Amin Alipour
The abundance of publicly available source code repositories, in conjunction with the advances in neural networks, has enabled data-driven approaches to program analysis. These approaches, called neural program analyzers, use neural networks to extract patterns in the programs for tasks ranging from development productivity to program reasoning. Despite the growing popularity of neural program analyzers, the extent to which their results are generalizable is unknown. In this paper, we perform a large-scale evaluation of the generalizability of two popular neural program analyzers using seven semantically-equivalent transformations of programs. Our results caution that in many cases the neural program analyzers fail to generalize well, sometimes to programs with negligible textual differences. The results provide the initial stepping stones for quantifying robustness in neural program analyzers.
SEAug 27, 2019
K-CONFIG: Using Failing Test Cases to Generate Test Cases in GCC CompilersMd Rafiqul Islam Rabin, Mohammad Amin Alipour
The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce programs that do not reflect the intents of programmers. Compilers are complex software systems due to the complexity of optimization. GCC is an optimizing C compiler that has been used in building operating systems and many other system software. In this paper, we describe K-CONFIG, an approach that uses the bugs reported in the GCC repository to generate new test inputs. Our main insight is that the features appearing in the bug reports are likely to reappear in the future bugs, as the bugfixes can be incomplete or those features may be inherently challenging to implement hence more prone to errors. Our approach first clusters the failing test input extracted from the bug reports into clusters of similar test inputs. It then uses these clusters to create configurations for Csmith, the most popular test generator for C compilers. In our experiments on two versions of GCC, our approach could trigger up to 36 miscompilation failures, and 179 crashes, while Csmith with the default configuration did not trigger any failures. This work signifies the benefits of analyzing and using the reported bugs in the generation of new test inputs.
LGAug 25, 2019
Testing Neural Program AnalyzersMd Rafiqul Islam Rabin, Ke Wang, Mohammad Amin Alipour
Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.
HCFeb 18, 2019
Topics of Concern: Identifying User Issues in Reviews of IoT Apps and DevicesAndrew Truelove, Farah Naz Chowdhury, Omprakash Gnawali et al.
Internet of Things (IoT) systems are bundles of networked sensors and actuators that are deployed in an environment and act upon the sensory data that they receive. These systems, especially consumer electronics, have two main cooperating components: a device and a mobile app. The unique combination of hardware and software in IoT systems presents challenges that are lesser known to mainstream software developers. They might require innovative solutions to support the development and integration of such systems. In this paper, we analyze more than 90,000 reviews of ten IoT devices and their corresponding apps and extract the issues that users encountered while using these systems. Our results indicate that issues with connectivity, timing, and updates are particularly prevalent in the reviews. Our results call for a new software-hardware development framework to assist the development of reliable IoT systems.
HCFeb 17, 2019
An Automated Testing Framework for Conversational AgentsSoodeh Atefi, Mohammad Amin Alipour
Conversational agents are systems with a conversational interface that afford interaction in spoken language. These systems are becoming prevalent and are preferred in various contexts and for many users. Despite their increasing success, the automated testing infrastructure to support the effective and efficient development of such systems compared to traditional software systems is still limited. Automated testing framework for conversational systems can improve the quality of these systems by assisting developers to write, execute, and maintain test cases. In this paper, we introduce our work-in-progress automated testing framework, and its realization in the Python programming language. We discuss some research problems in the development of such an automated testing framework for conversational agents. In particular, we point out the problems of the specification of the expected behavior, known as test oracles, and semantic comparison of utterances.
SENov 4, 2016
Data Poisoning: Lightweight Soft Fault Injection for PythonMohammad Amin Alipour, Alex Groce
This paper introduces and explores the idea of data poisoning, a light-weight peer-architecture technique to inject faults into Python programs. This method requires very small modification to the original program, which facilitates evaluation of sensitivity of systems that are prototyped or modeled in Python. We propose different fault scenarios that can be injected to programs using data poisoning. We use Dijkstra's Self Stabilizing Ring Algorithm to illustrate the approach.
SESep 20, 2016
Finding Model-Checkable Needles in Large Source Code Haystacks: Modular Bug-Finding via Static Analysis and Dynamic Invariant DiscoveryMohammad Amin Alipour, Alex Groce, Chaoqiang Zhang et al.
In this paper, we present a novel marriage of static and dynamic analysis. Given a large code base with many functions and a mature test suite, we propose using static analysis to find functions 1) with assertions or other evident correctness properties (e.g., array bounds requirements or pointer access) and 2) with simple enough control flow and data use to be amenable to predicate-abstraction based or bounded model checking without human intervention. Because most such functions in realistic software systems in fact rely on many input preconditions not specified by the language's type system (or annotated in any way), we propose using dynamically discovered invariants based on a program's test suite to characterize likely preconditions, in order to reduce the problem of false positives. While providing little in the way of verification, this approach may provide an additional quick and highly scalable bug-finding method for programs that are usually considered "too large to model check." We present a simple example showing that the technique can be useful for a more typically "model-checkable" code base, even in the presence of a poorly designed test suite and bad invariants.
SESep 20, 2016
Bounded Model Checking and Feature Omission DiversityMohammad Amin Alipour, Alex Groce
In this paper we introduce a novel way to speed up the discovery of counterexamples in bounded model checking, based on parallel runs over versions of a system in which features have been randomly disabled. As shown in previous work, adding constraints to a bounded model checking problem can reduce the size of the verification problem and dramatically decrease the time required to find counterexample. Adapting a technique developed in software testing to this problem provides a simple way to produce useful partial verification problems, with a resulting decrease in average time until a counterexample is produced. If no counterexample is found, partial verification results can also be useful in practice.