James Cheney

CR
13papers
270citations
Novelty40%
AI Score40

13 Papers

LGFeb 2
Refining Decision Boundaries In Anomaly Detection Using Similarity Search Within the Feature Space

Sidahmed Benabderrahmane, Petko Valtchev, James Cheney et al.

Detecting rare and diverse anomalies in highly imbalanced datasets-such as Advanced Persistent Threats (APTs) in cybersecurity-remains a fundamental challenge for machine learning systems. Active learning offers a promising direction by strategically querying an oracle to minimize labeling effort, yet conventional approaches often fail to exploit the intrinsic geometric structure of the feature space for model refinement. In this paper, we introduce SDA2E, a Sparse Dual Adversarial Attention-based AutoEncoder designed to learn compact and discriminative latent representations from imbalanced, high-dimensional data. We further propose a similarity-guided active learning framework that integrates three novel strategies to refine decision boundaries efficiently: mormal-like expansion, which enriches the training set with points similar to labeled normals to improve reconstruction fidelity; anomaly-like prioritization, which boosts ranking accuracy by focusing on points resembling known anomalies; and a hybrid strategy that combines both for balanced model refinement and ranking. A key component of our framework is a new similarity measure, Normalized Matching 1s (SIM_NM1), tailored for sparse binary embeddings. We evaluate SDA2E extensively across 52 imbalanced datasets, including multiple DARPA Transparent Computing scenarios, and benchmark it against 15 state-of-the-art anomaly detection methods. Results demonstrate that SDA2E consistently achieves superior ranking performance (nDCG up to 1.0 in several cases) while reducing the required labeled data by up to 80% compared to passive training. Statistical tests confirm the significance of these improvements. Our work establishes a robust, efficient, and statistically validated framework for anomaly detection that is particularly suited to cybersecurity applications such as APT detection.

LGNov 25, 2025
Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders

Sidahmed Benabderrahmane, James Cheney, Talal Rahwan

Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.

CRJun 27, 2024
Hack Me If You Can: Aggregating AutoEncoders for Countering Persistent Access Threats Within Highly Imbalanced Data

Sidahmed Benabderrahmane, Ngoc Hoang, Petko Valtchev et al.

Advanced Persistent Threats (APTs) are sophisticated, targeted cyberattacks designed to gain unauthorized access to systems and remain undetected for extended periods. To evade detection, APT cyberattacks deceive defense layers with breaches and exploits, thereby complicating exposure by traditional anomaly detection-based security methods. The challenge of detecting APTs with machine learning is compounded by the rarity of relevant datasets and the significant imbalance in the data, which makes the detection process highly burdensome. We present AE-APT, a deep learning-based tool for APT detection that features a family of AutoEncoder methods ranging from a basic one to a Transformer-based one. We evaluated our tool on a suite of provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The outcomes showed that AE-APT has significantly higher detection rates compared to its competitors, indicating superior performance in detecting and ranking anomalies.

CRMay 20, 2021
A Rule Mining-Based Advanced Persistent Threats Detection System

Sidahmed Benabderrahmane, Ghita Berrada, James Cheney et al.

Advanced persistent threats (APT) are stealthy cyber-attacks that are aimed at stealing valuable information from target organizations and tend to extend in time. Blocking all APTs is impossible, security experts caution, hence the importance of research on early detection and damage limitation. Whole-system provenance-tracking and provenance trace mining are considered promising as they can help find causal relationships between activities and flag suspicious event sequences as they occur. We introduce an unsupervised method that exploits OS-independent features reflecting process activity to detect realistic APT-like attacks from provenance traces. Anomalous processes are ranked using both frequent and rare event associations learned from traces. Results are then presented as implications which, since interpretable, help leverage causality in explaining the detected anomalies. When evaluated on Transparent Computing program datasets (DARPA), our method outperformed competing approaches.

SEDec 3, 2020
VICToRy: Visual Interactive Consistency Management in Tolerant Rule-based Systems

Nils Weidmann, Anthony Anjorin, James Cheney

In the field of Model-Driven Engineering, there exist numerous tools that support various consistency management operations including model transformation, synchronisation and consistency checking. The supported operations, however, typically run completely in the background with only input and output made visible to the user. We argue that this often reduces both understandability and controllability. As a step towards improving this situation, we present VICToRy, a debugger for model generation and transformation based on Triple Graph Grammars, a well-known rule-based approach to bidirectional transformation. In addition to a fine-grained, step-by-step, interactive visualisation, VICToRy enables the user to actively explore and choose between multiple valid rule applications thus improving control and understanding.

DBJun 14, 2020
Categorical anomaly detection in heterogeneous data using minimum description length clustering

James Cheney, Xavier Gombau, Ghita Berrada et al.

Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets representing a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.

CRSep 24, 2019
ProvMark: A Provenance Expressiveness Benchmarking System

Sheung Chi Chan, James Cheney, Pramod Bhatotia et al.

System level provenance is of widespread interest for applications such as security enforcement and information protection. However, testing the correctness or completeness of provenance capture tools is challenging and currently done manually. In some cases there is not even a clear consensus about what behavior is correct. We present an automated tool, ProvMark, that uses an existing provenance system as a black box and reliably identifies the provenance graph structure recorded for a given activity, by a reduction to subgraph isomorphism problems handled by an external solver. ProvMark is a beginning step in the much needed area of testing and comparing the expressiveness of provenance systems. We demonstrate ProvMark's usefuless in comparing three capture systems with different architectures and distinct design philosophies.

PLJul 20, 2019
Towards meta-interpretive learning of programming language semantics

Sándor Bartha, James Cheney

We introduce a new application for inductive logic programming: learning the semantics of programming languages from example evaluations. In this short paper, we explored a simplified task in this domain using the Metagol meta-interpretive learning system. We highlighted the challenging aspects of this scenario, including abstracting over function symbols, nonterminating examples, and learning non-observed predicates, and proposed extensions to Metagol helpful for overcoming these challenges, which may prove useful in other domains.

CRJun 17, 2019
A baseline for unsupervised advanced persistent threat detection in system-level provenance

Ghita Berrada, Sidahmed Benabderrahmane, James Cheney et al.

Advanced persistent threats (APT) are stealthy, sophisticated, and unpredictable cyberattacks that can steal intellectual property, damage critical infrastructure, or cause millions of dollars in damage. Detecting APTs by monitoring system-level activity is difficult because manually inspecting the high volume of normal system activity is overwhelming for security analysts. We evaluate the effectiveness of unsupervised batch and streaming anomaly detection algorithms over multiple gigabytes of provenance traces recorded on four different operating systems to determine whether they can detect realistic APT-like attacks reliably and efficiently. This report is the first detailed study of the effectiveness of generic unsupervised anomaly detection techniques in this setting.

PLMay 11, 2015
Notions of bidirectional computation and entangled state monads

Faris Abou-Saleh, James Cheney, Jeremy Gibbons et al.

Bidirectional transformations (bx) support principled consistency maintenance between data sources. Each data source corresponds to one perspective on a composite system, manifested by operations to 'get' and 'set' a view of the whole from that particular perspective. Bx are important in a wide range of settings, including databases, interactive applications, and model-driven development. We show that bx are naturally modelled in terms of mutable state; in particular, the 'set' operations are stateful functions. This leads naturally to considering bx that exploit other computational effects too, such as I/O, nondeterminism, and failure, all largely ignored in the bx literature to date. We present a semantic foundation for symmetric bidirectional transformations with effects. We build on the mature theory of monadic encapsulation of effects in functional programming, develop the equational theory and important combinators for effectful bx, and provide a prototype implementation in Haskell along with several illustrative examples.

SEFeb 9, 2015
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Timothy McPhillips, Tianhong Song, Tyler Kolisnik et al.

Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.

DBMay 22, 2014
An Analytical Survey of Provenance Sanitization

James Cheney, Roly Perera

Security is likely becoming a critical factor in the future adoption of provenance technology, because of the risk of inadvertent disclosure of sensitive information. In this survey paper we review the state of the art in secure provenance, considering mechanisms for controlling access, and the extent to which these mechanisms preserve provenance integrity. We examine seven systems or approaches, comparing features and identifying areas for future work.

DBAug 2, 2013
Static Enforceability of XPath-Based Access Control Policies

James Cheney

We consider the problem of extending XML databases with fine-grained, high-level access control policies specified using XPath expressions. Most prior work checks individual updates dynamically, which is expensive (requiring worst-case execution time proportional to the size of the database). On the other hand, static enforcement can be performed without accessing the database but may be incomplete, in the sense that it may forbid accesses that dynamic enforcement would allow. We introduce topological characterizations of XPath fragments in order to study the problem of determining when an access control policy can be enforced statically without loss of precision. We introduce the notion of fair policies that are statically enforceable, and study the complexity of determining fairness and of static enforcement itself.