Steven H. H. Ding

CR
h-index16
7papers
875citations
Novelty56%
AI Score47

7 Papers

SEJul 20, 2023
Pluvio: Assembly Clone Search for Out-of-domain Architectures and Libraries through Transfer Learning and Conditional Variational Information Bottleneck

Zhiwei Fu, Steven H. H. Ding, Furkan Alaca et al.

The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting from reuse in released executables. Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants produced by different toolchains. However, these methods are limited to what they learn from a small number of toolchain variants used in training, rendering them inapplicable to unseen architectures and their corresponding compilation toolchain variants. This paper presents the first study on the problem of assembly clone search with unseen architectures and libraries. We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search. Transfer learning can aid in addressing the limitations of the existing approaches, as it can bring in broader knowledge from human experts in assembly code. We further address the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens. Coupled with a new Variational Information Bottleneck learning strategy, the proposed system minimizes the reliance on potential indicators of architectures and optimization settings, for a better generalization of unseen architectures. We simulate the unseen architecture clone search scenarios and the experimental results show the effectiveness of the proposed approach against the state-of-the-art solutions.

37.3SEApr 13
How Developers Adopt, Use, and Evolve CI/CD Caching: An Empirical Study on GitHub Actions

Kazi Amit Hasan, Yuan Tian, Safwat Hassan et al.

Continuous Integration/Continuous Delivery (CI/CD) caching is widely used to reduce repeated computation and improve CI/CD efficiency, yet maintaining effective caching requires ongoing maintenance effort. In this paper, we present the first empirical study on how developers configure and evolve caching in CI/CD workflows on GitHub Actions. We analyze 952 GitHub repositories (266 cache adopters and 686 non-adopters), to compare repository characteristics, characterize caching usage at the job and step levels, uncover patterns in caching configuration evolution, and identify the drivers of cache-related changes. Our analysis spans 1,556 workflow files, 10,373 commits, and 17,185 workflow configuration changes, including an average of 9.37 cache-related changes per repository. Our main observations are: (1) cache-adopting repositories are more active and popular than non-adopters; (2) caching is used across multiple CI/CD job types through a variety of caching mechanisms rather than a single standardized approach; (3) caching configurations evolve through frequent, repetitive maintenance patterns, with rapid updates in build and test jobs and slower evolution in other job types; and (4) cache-related modifications are driven by distinct maintenance needs: parameter updates are mainly human-driven to fix issues, while version updates occur later and are often bot-driven for dependency maintenance. Our findings quantify the substantial maintenance effort involved in CI/CD caching and highlight opportunities to improve reliability and tool support.

CRApr 3, 2024
Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability Detection

Litao Li, Steven H. H. Ding, Andrew Walenstein et al.

Software vulnerabilities are a challenge in cybersecurity. Manual security patches are often difficult and slow to be deployed, while new vulnerabilities are created. Binary code vulnerability detection is less studied and more complex compared to source code, and this has important practical implications. Deep learning has become an efficient and powerful tool in the security domain, where it provides end-to-end and accurate prediction. Modern deep learning approaches learn the program semantics through sequence and graph neural networks, using various intermediate representation of programs, such as abstract syntax trees (AST) or control flow graphs (CFG). Due to the complex nature of program execution, the output of an execution depends on the many program states and inputs. Also, a CFG generated from static analysis can be an overestimation of the true program flow. Moreover, the size of programs often does not allow a graph neural network with fixed layers to aggregate global information. To address these issues, we propose DeepEXE, an agent-based implicit neural network that mimics the execution path of a program. We use reinforcement learning to enhance the branching decision at every program state transition and create a dynamic environment to learn the dependency between a vulnerability and certain program states. An implicitly defined neural network enables nearly infinite state transitions until convergence, which captures the structural information at a higher level. The experiments are conducted on two semi-synthetic and two real-world datasets. We show that DeepEXE is an accurate and efficient method and outperforms the state-of-the-art vulnerability detection methods.

AISep 27, 2025
Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Charles E. Gagnon, Steven H. H. Ding, Philippe Charland et al.

Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identifying semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.

LGSep 15, 2019
I-MAD: Interpretable Malware Detector Using Galaxy Transformer

Miles Q. Li, Benjamin C. M. Fung, Philippe Charland et al.

Malware currently presents a number of serious threats to computer users. Signature-based malware detection methods are limited in detecting new malware samples that are significantly different from known ones. Therefore, machine learning-based methods have been proposed, but there are two challenges these methods face. The first is to model the full semantics behind the assembly code of malware. The second challenge is to provide interpretable results while keeping excellent detection performance. In this paper, we propose an Interpretable MAlware Detector (I-MAD) that outperforms state-of-the-art static malware detection models regarding accuracy with excellent interpretability. To improve the detection performance, I-MAD incorporates a novel network component called the Galaxy Transformer network that can understand assembly code at the basic block, function, and executable levels. It also incorporates our proposed interpretable feed-forward neural network to provide interpretations for its detection results by quantifying the impact of each feature with respect to the prediction. Experiment results show that our model significantly outperforms existing state-of-the-art static malware detection models and presents meaningful interpretations.

CRJul 20, 2019
ER-AE: Differentially Private Text Generation for Authorship Anonymization

Haohan Bo, Steven H. H. Ding, Benjamin C. M. Fung et al.

Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recipients to interpret. We propose a novel text generation model with a two-set exponential mechanism for authorship anonymization. By augmenting the semantic information through a REINFORCE training reward function, the model can generate differentially private text that has a close semantic and similar grammatical structure to the original text while removing personal traits of the writing style. It does not assume any conditioned labels or paralleled text data for training. We evaluate the performance of the proposed model on the real-life peer reviews dataset and the Yelp review dataset. The result suggests that our model outperforms the state-of-the-art on semantic preservation, authorship obfuscation, and stylometric transformation.

CLJun 3, 2016
Learning Stylometric Representations for Authorship Analysis

Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal et al.

Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification with the Twitter, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the bag-of-lexical-n-grams, Latent Dirichlet Allocation, Latent Semantic Analysis, PVDM, PVDBOW, and word2vec representations.