Cody Fleming

LG
h-index16
18papers
170citations
Novelty57%
AI Score56

18 Papers

SYMay 15
Functional requirements decomposition in set-based design

Minghui Sun, Zhaoyang Chen, Georgios Bakirtzis et al.

Designing systems is typically uncertain and ambiguous at early stages. Set-based design supports alternative exploration and gradual uncertainty reduction during the early lifecycle, making it practical for complex systems design. In parallel, the functional requirements decomposition helps to advance the design incrementally. However, current literature on set-based design lacks formal guidance in how to decompose functional requirements. To bridge this gap, we introduce a four-step method to decompose functional requirements for set-based design hierarchically. We systematically define, reason, and narrow the sets, breaking down the functional requirements into formal sub-requirements. This method allows parallel abstraction, ensuring the resulting system satisfies the top-level functional requirements.

LGMay 18
COOPO: Cyclic Offline-Online Policy Optimization Algorithm

Qisai Liu, Zhanhong Jiang, Joshua Russell Waite et al.

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

LGMay 11
Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

Adam Haroon, Erick J. Rodríguez-Seda, Cody Fleming et al.

Safe reinforcement learning (RL) typically asks $\textit{what}$ an agent should do. We ask $\textit{when}$ it needs to act, and show that a single policy can jointly learn control inputs and communication-efficient timing decisions under a pointwise Lyapunov safety shield. We focus on stabilization around a known equilibrium, where CARE-based LQR backups, Lyapunov certificates, and classical Lyapunov-STC are well defined, enabling clean comparison against analytical baselines. A run-time assurance (RTA) layer overrides the policy via a one-step-ahead Lyapunov prediction and a precomputed LQR backup, providing a strictly stronger guarantee than constrained MDP methods that enforce safety only in expectation. On an inverted pendulum, cart--pole, and planar quadrotor, the learned policy achieves $1.91\times$, $1.45\times$, and $3.51\times$ higher mean inter-sample interval (MSI) than a Lyapunov-triggered baseline; a fixed LQR controller at the same average rate is unstable on all three plants, showing that adaptive timing, not a lower average rate, makes sparsity safe. A CARE-derived Lyapunov reward transfers across environments without redesign, with a single weight $w_c$ controlling the stability--communication tradeoff; ablations confirm the RTA shield is essential, with its removal reducing MSI by $1.27$--$1.84\times$ and degrading state norms. A preference-conditioned extension recovers the full tradeoff frontier from one model at $\tfrac{2}{11}$ of training compute, and SAC experiments show the results are algorithm-agnostic across discrete and continuous domains. A 12-state 3D quadrotor case study extends the framework to higher-dimensional systems where classical STC is intractable, and robustness to $\pm30\%$ mass variation and disturbances shows graceful degradation, with the RTA absorbing what the learned policy cannot.

LGFeb 19
LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

Hsin-Jung Yang, Zhanhong Jiang, Prajwal Koirala et al.

Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.

LGFeb 5
Toward Faithful and Complete Answer Construction from a Single Document

Zhaoyang Chen, Cody Fleming

Modern large language models (LLMs) are powerful generators driven by statistical next-token prediction. While effective at producing fluent text, this design biases models toward high-probability continuations rather than exhaustive and faithful answers grounded in source content. As a result, directly applying LLMs lacks systematic mechanisms to ensure both completeness (avoiding omissions) and faithfulness (avoiding unsupported content), which fundamentally conflicts with core AI safety principles. To address this limitation, we present EVE, a structured framework for document-grounded reasoning. Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration. Empirically, this design enables consistent and simultaneous improvements in recall, precision, and F1-score: recall and precision increase by up to 24\% and 29\%, respectively, with a corresponding 31\% gain in F1-score. This effectively breaks the long-standing trade-off between coverage and accuracy typical of single-pass LLM generation, while also mitigating generation truncation caused by length limitations. At the same time, we emphasize that EVE exhibits performance saturation due to the inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning.

LGJun 26, 2025
Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Prajwal Koirala, Cody Fleming

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

LGDec 11, 2024
Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar et al.

In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.

LGDec 12, 2024
FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar et al.

Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.

LGDec 13, 2025
Neural CDEs as Correctors for Learned Time Series Models

Muhammad Bilal Shahid, Prajwal Koirla, Cody Fleming

Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

LGMay 19, 2025
Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

Zhaoyang Chen, Cody Fleming

Classifier free guidance has shown strong potential in diffusion-based reinforcement learning. However, existing methods rely on joint training of the guidance module and the diffusion model, which can be suboptimal during the early stages when the guidance is inaccurate and provides noisy learning signals. In offline RL, guidance depends solely on offline data: observations, actions, and rewards, and is independent of the policy module's behavior, suggesting that joint training is not required. This paper proposes modular training methods that decouple the guidance module from the diffusion model, based on three key findings: Guidance Necessity: We explore how the effectiveness of guidance varies with the training stage and algorithm choice, uncovering the roles of guidance and diffusion. A lack of good guidance in the early stage presents an opportunity for optimization. Guidance-First Diffusion Training: We introduce a method where the guidance module is first trained independently as a value estimator, then frozen to guide the diffusion model using classifier-free reward guidance. This modularization reduces memory usage, improves computational efficiency, and enhances both sample efficiency and final performance. Cross-Module Transferability: Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance (e.g., reducing IQR by 86%). We show that guidance modules trained with one algorithm (e.g., IDQL) can be directly reused with another (e.g., DQL), with no additional training required, demonstrating baseline-level performance as well as strong modularity and transferability. We provide theoretical justification and empirical validation on bullet D4RL benchmarks. Our findings suggest a new paradigm for offline RL: modular, reusable, and composable training pipelines.

LGFeb 11, 2024
Towards Robust Car Following Dynamics Modeling via Blackbox Models: Methodology, Analysis, and Recommendations

Muhammad Bilal Shahid, Cody Fleming

The selection of the target variable is important while learning parameters of the classical car following models like GIPPS, IDM, etc. There is a vast body of literature on which target variable is optimal for classical car following models, but there is no study that empirically evaluates the selection of optimal target variables for black-box models, such as LSTM, etc. The black-box models, like LSTM and Gaussian Process (GP) are increasingly being used to model car following behavior without wise selection of target variables. The current work tests different target variables, like acceleration, velocity, and headway, for three black-box models, i.e., GP, LSTM, and Kernel Ridge Regression. These models have different objective functions and work in different vector spaces, e.g., GP works in function space, and LSTM works in parameter space. The experiments show that the optimal target variable recommendations for black-box models differ from classical car following models depending on the objective function and the vector space. It is worth mentioning that models and datasets used during evaluation are diverse in nature: the datasets contained both automated and human-driven vehicle trajectories; the black-box models belong to both parametric and non-parametric classes of models. This diversity is important during the analysis of variance, wherein we try to find the interaction between datasets, models, and target variables. It is shown that the models and target variables interact and recommended target variables don't depend on the dataset under consideration.

LGJan 21, 2024
Solving Offline Reinforcement Learning with Decision Tree Regression

Prajwal Koirala, Cody Fleming

This study presents a novel approach to addressing offline reinforcement learning (RL) problems by reframing them as regression tasks that can be effectively solved using Decision Trees. Mainly, we introduce two distinct frameworks: return-conditioned and return-weighted decision tree policies (RCDTP and RWDTP), both of which achieve notable speed in agent training as well as inference, with training typically lasting less than a few minutes. Despite the simplification inherent in this reformulated approach to offline RL, our agents demonstrate performance that is at least on par with the established methods. We evaluate our methods on D4RL datasets for locomotion and manipulation, as well as other robotic tasks involving wheeled and flying robots. Additionally, we assess performance in delayed/sparse reward scenarios and highlight the explainability of these policies through action distribution and feature importance.

CVJan 29, 2021
SCAN: A Spatial Context Attentive Network for Joint Multi-Agent Intent Prediction

Jasmine Sekhon, Cody Fleming

Safe navigation of autonomous agents in human centric environments requires the ability to understand and predict motion of neighboring pedestrians. However, predicting pedestrian intent is a complex problem. Pedestrian motion is governed by complex social navigation norms, is dependent on neighbors' trajectories, and is multimodal in nature. In this work, we propose SCAN, a Spatial Context Attentive Network that can jointly predict socially-acceptable multiple future trajectories for all pedestrians in a scene. SCAN encodes the influence of spatially close neighbors using a novel spatial attention mechanism in a manner that relies on fewer assumptions, is parameter efficient, and is more interpretable compared to state-of-the-art spatial attention approaches. Through experiments on several datasets we demonstrate that our approach can also quantitatively outperform state of the art trajectory prediction methods in terms of accuracy of predicted intent.

CRNov 29, 2020
Cyberphysical Security Through Resiliency: A Systems-centric Approach

Cody Fleming, Carl Elks, Georgios Bakirtzis et al.

Cyber-physical systems (CPS) are often defended in the same manner as information technology (IT) systems -- by using perimeter security. Multiple factors make such defenses insufficient for CPS. Resiliency shows potential in overcoming these shortfalls. Techniques for achieving resilience exist; however, methods and theory for evaluating resilience in CPS are lacking. We argue that such methods and theory should assist stakeholders in deciding where and how to apply design patterns for resilience. Such a problem potentially involves tradeoffs between different objectives and criteria, and such decisions need to be driven by traceable, defensible, repeatable engineering evidence. Multi-criteria resiliency problems require a system-oriented approach that evaluates systems in the presence of threats as well as potential design solutions once vulnerabilities have been identified. We present a systems-oriented view of cyber-physical security, termed Mission Aware, that is based on a holistic understanding of mission goals, system dynamics, and risk.

ROJun 16, 2020
ShieldNN: A Provably Safe NN Filter for Unsafe NN Controllers

James Ferlez, Mahmoud Elnaggar, Yasser Shoukry et al.

In this paper, we develop a novel closed-form Control Barrier Function (CBF) and associated controller shield for the Kinematic Bicycle Model (KBM) with respect to obstacle avoidance. The proposed CBF and shield -- designed by an algorithm we call ShieldNN -- provide two crucial advantages over existing methodologies. First, ShieldNN considers steering and velocity constraints directly with the non-affine KBM dynamics; this is in contrast to more general methods, which typically consider only affine dynamics and do not guarantee invariance properties under control constraints. Second, ShieldNN provides a closed-form set of safe controls for each state unlike more general methods, which typically rely on optimization algorithms to generate a single instantaneous for each state. Together, these advantages make ShieldNN uniquely suited as an efficient Multi-Obstacle Safe Actions (i.e. multiple-barrier-function shielding) during training time of a Reinforcement Learning (RL) enabled Neural Network controller. We show via experiments that ShieldNN dramatically increases the completion rate of RL training episodes in the presence of multiple obstacles, thus establishing the value of ShieldNN in training RL-based controllers.

SEFeb 17, 2019
Towards Improved Testing For Deep Learning

Jasmine Sekhon, Cody Fleming

The growing use of deep neural networks in safety-critical applications makes it necessary to carry out adequate testing to detect and correct any incorrect behavior for corner case inputs before they can be actually used. Deep neural networks lack an explicit control-flow structure, making it impossible to apply to them traditional software testing criteria such as code coverage. In this paper, we examine existing testing methods for deep neural networks, the opportunities for improvement and the need for a fast, scalable, generalizable end-to-end testing method. We also propose a coverage criterion for deep neural networks that tries to capture all possible parts of the deep neural network's logic.

SYSep 18, 2018
PAIM: Platoon-based Autonomous Intersection Management

Masoud Bashiri, Hassan Jafarzadeh, Cody Fleming

With the emergence of autonomous ground vehicles and the recent advancements in Intelligent Transportation Systems, Autonomous Traffic Management has garnered more and more attention. Autonomous Intersection Management (AIM), also known as Cooperative Intersection Management (CIM) is among the more challenging traffic problems that poses important questions related to safety and optimization in terms of delays, fuel consumption, emissions and reliability. Previously we introduced two stop-sign based policies for autonomous intersection management that were compatible with platoons of autonomous vehicles. These policies outperformed regular stop-sign policy both in terms of average delay per vehicle and variance in delay. This paper introduces a reservation-based policy that utilizes the cost functions from our previous work to derive optimal schedules for platoons of vehicles. The proposed policy guarantees safety by not allowing vehicles with conflicting turning movement to be in the conflict zone at the same time. Moreover, a greedy algorithm is designed to search through all possible schedules to pick the best that minimizes a cost function based on a trade-off between total delay and variance in delay. A simulator software is designed to compare the results of the proposed policy in terms of average delay per vehicle and variance in delay with that of a 4-phase traffic light.

CRNov 2, 2017
A Systems Approach for Eliciting Mission-Centric Security Requirements

Bryan Carter, Georgios Bakirtzis, Carl Elks et al.

The security of cyber-physical systems is first and foremost a safety problem, yet it is typically handled as a traditional security problem, which means that solutions are based on defending against threats and are often implemented too late. This approach neglects to take into consideration the context in which the system is intended to operate, thus system safety may be compromised. This paper presents a systems-theoretic analysis approach that combines stakeholder perspectives with a modified version of Systems-Theoretic Accident Model and Process (STAMP) that allows decision-makers to strategically enhance the safety, resilience, and security of a cyber-physical system against potential threats. This methodology allows the capture of vital mission-specific information in a model, which then allows analysts to identify and mitigate vulnerabilities in the locations most critical to mission success. We present an overview of the general approach followed by a real example using an unmanned aerial vehicle conducting a reconnaissance mission.