Shengbo Wang

LG
h-index78
28papers
252citations
Novelty62%
AI Score57

28 Papers

LGFeb 26, 2023
A Finite Sample Complexity Bound for Distributionally Robust Q-learning

Shengbo Wang, Nian Si, Jose Blanchet et al. · stanford

We consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al. [2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust $Q$-function within an $ε$ error in the sup norm is upper bounded by $\tilde O(|S||A|(1-γ)^{-5}ε^{-2}p_{\wedge}^{-6}δ^{-4})$, where $γ$ is the discount rate, $p_{\wedge}$ is the non-zero minimal support probability of the transition kernels and $δ$ is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.

LGNov 15, 2023
On the Foundation of Distributionally Robust Reinforcement Learning

Shengbo Wang, Nian Si, Jose Blanchet et al. · stanford

Motivated by the need for a robust policy in the face of environment shifts between training and deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around robust Markov decision processes (RMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct RMDPs that embrace various modeling attributes for both the decision maker and the adversary. These attributes include the structure of information availability-covering history-dependent, Markov, and Markov time-homogeneous dynamics-as well as constraints on the shifts induced by the adversary, with a focus on SA- and S-rectangularity. Within this RMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficient DRRL algorithms are reliant on the DPP. To investigate its existence, we systematically analyze various combinations of controller and adversary attributes, presenting streamlined proofs based on a unified methodology. We then construct counterexamples for settings where a fully general DPP fails to hold and establish asymptotically optimal history-dependent policies for key scenarios where the DPP is absent.

SPSep 16, 2023
Intelligent machines work in unstructured environments by differential neuromorphic computing

Shengbo Wang, Shuo Gao, Chenyu Tang et al.

Efficient operation of intelligent machines in the real world requires methods that allow them to understand and predict the uncertainties presented by the unstructured environments with good accuracy, scalability and generalization, similar to humans. Current methods rely on pretrained networks instead of continuously learning from the dynamic signal properties of working environments and suffer inherent limitations, such as data-hungry procedures, and limited generalization capabilities. Herein, we present a memristor-based differential neuromorphic computing, perceptual signal processing and learning method for intelligent machines. The main features of environmental information such as amplification (>720%) and adaptation (<50%) of mechanical stimuli encoded in memristors, are extracted to obtain human-like processing in unstructured environments. The developed method takes advantage of the intrinsic multi-state property of memristors and exhibits good scalability and generalization, as confirmed by validation in two different application scenarios: object grasping and autonomous driving. In the former, a robot hand experimentally realizes safe and stable grasping through fast learning (in ~1 ms) the unknown object features (e.g., sharp corner and smooth surface) with a single memristor. In the latter, the decision-making information of 10 unstructured environments in autonomous driving (e.g., overtaking cars, pedestrians) is accurately (94%) extracted with a 40*25 memristor array. By mimicking the intrinsic nature of human low-level perception mechanisms, the electronic memristive neuromorphic circuit-based method, presented here shows the potential for adapting to diverse sensing technologies and helping intelligent machines generate smart high-level decisions in the real world.

LGOct 13, 2023
Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang, Jose Blanchet, Peter Glynn · stanford

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 ε^{-2})$ and a lower bound of $Ω(|S||A|t_{\text{mix}} ε^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $ε$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}ε^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

SYJul 3, 2023
Model-Assisted Probabilistic Safe Adaptive Control With Meta-Bayesian Learning

Shengbo Wang, Ke Li, Yin Yang et al.

Breaking safety constraints in control systems can lead to potential risks, resulting in unexpected costs or catastrophic damage. Nevertheless, uncertainty is ubiquitous, even among similar tasks. In this paper, we develop a novel adaptive safe control framework that integrates meta learning, Bayesian models, and control barrier function (CBF) method. Specifically, with the help of CBF method, we learn the inherent and external uncertainties by a unified adaptive Bayesian linear regression (ABLR) model, which consists of a forward neural network (NN) and a Bayesian output layer. Meta learning techniques are leveraged to pre-train the NN weights and priors of the ABLR model using data collected from historical similar tasks. For a new control task, we refine the meta-learned models using a few samples, and introduce pessimistic confidence bounds into CBF constraints to ensure safe control. Moreover, we provide theoretical criteria to guarantee probabilistic safety during the control processes. To validate our approach, we conduct comparative experiments in various obstacle avoidance scenarios. The results demonstrate that our algorithm significantly improves the Bayesian model-based CBF method, and is capable for efficient safe exploration even with multiple uncertain constraints.

LGFeb 15, 2023
Optimal Sample Complexity of Reinforcement Learning for Mixing Discounted Markov Decision Processes

Shengbo Wang, Jose Blanchet, Peter Glynn · stanford

We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for maximizing the infinite horizon discounted reward in a Markov decision process (MDP). Optimal worst-case complexity results have been developed for tabular RL problems in this setting, leading to a sample complexity dependence on $γ$ and $ε$ of the form $\tilde Θ((1-γ)^{-3}ε^{-2})$, where $γ$ denotes the discount factor and $ε$ is the solution error tolerance. However, in many applications of interest, the optimal policy (or all policies) induces mixing. We establish that in such settings, the optimal sample complexity dependence is $\tilde Θ(t_{\text{mix}}(1-γ)^{-2}ε^{-2})$, where $t_{\text{mix}}$ is the total variation mixing time. Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.

OCMar 1
Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values

Shengbo Wang, Nian Si · stanford

We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust optimal policies, proving that average-reward optimality alone can mask arbitrarily poor transients and deriving regret-based lower bounds on transient values. Finally, we construct an epoch-based policy that combines an optimal stationary policy for the worst-case model with an anytime-valid sequential test and an online learning fallback, achieving a constant-order transient value.

CVSep 10, 2024
Neuromorphic spatiotemporal optical flow: Enabling ultrafast visual perception beyond human capabilities

Shengbo Wang, Jingwen Zhao, Tongming Pu et al.

Optical flow, inspired by the mechanisms of biological visual systems, calculates spatial motion vectors within visual scenes that are necessary for enabling robotics to excel in complex and dynamic working environments. However, current optical flow algorithms, despite human-competitive task performance on benchmark datasets, remain constrained by unacceptable time delays (~0.6 seconds per inference, 4X human processing speed) in practical deployment. Here, we introduce a neuromorphic optical flow approach that addresses delay bottlenecks by encoding temporal information directly in a synaptic transistor array to assist spatial motion analysis. Compared to conventional spatial-only optical flow methods, our spatiotemporal neuromorphic optical flow offers the spatial-temporal consistency of motion information, rapidly identifying regions of interest in as little as 1-2 ms using the temporal motion cues derived from the embedded temporal information in the two-dimensional floating gate synaptic transistors. Thus, the visual input can be selectively filtered to achieve faster velocity calculations and various task execution. At the hardware level, due to the atomically sharp interfaces between distinct functional layers in two-dimensional van der Waals heterostructures, the synaptic transistor offers high-frequency response (~100 μs), robust non-volatility (>10000 s), and excellent endurance (>8000 cycles), enabling robust visual processing. In software benchmarks, our system outperforms state-of-the-art algorithms with a 400% speedup, frequently surpassing human-level performance while maintaining or enhancing accuracy by utilizing the temporal priors provided by the embedded temporal information.

78.0SEMar 21
His2Trans: A Skeleton-First Framework for Self-Evolving C-to-Rust Translation with Historical Retrieval

Shengbo Wang, Mingwei Liu, Guangsheng Ou et al.

Automated C-to-Rust migration encounters systemic obstacles when scaling from code snippets to industrial projects, mainly because build context is often unavailable ("dependency hell") and domain-specific evolutionary knowledge is missing. As a result, current LLM-based methods frequently cannot reconstruct precise type definitions under complex build systems or infer idiomatic API correspondences, which in turn leads to hallucinated dependencies and unproductive repair loops.To tackle these issues, we introduce His2Trans, a framework that combines a deterministic, build-aware skeleton with self-evolving knowledge extraction to support stable, incremental migration. On the structural side, His2Trans performs build tracing to create a compilable Project-Level Skeleton Graph, providing a strictly typed environment that separates global verification from local logic generation. On the cognitive side, it derives fine-grained API and code-fragment rules from historical migration traces and uses a Retrieval-Augmented Generation (RAG) system to steer the LLM toward idiomatic interface reuse.Experiments on industrial OpenHarmony modules show that His2Trans reaches a 97.51% incremental compilation pass rate, effectively fixing build failures where baselines struggle. On general-purpose benchmarks, it reduces the unsafe code ratio by 25.23 percentage points compared with C2Rust while also lowering warning counts, although cross-domain functional correctness remains challenging. Finally, knowledge accumulation studies demonstrate the framework's evolutionary behavior: by continuously integrating verified patterns, His2Trans cuts repair overhead on unseen tasks by about 60%.

LGJan 29
Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle

Zijun Chen, Zaiwei Chen, Nian Si et al.

We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.

40.6LGMay 8
Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL

Shengbo Wang, Zexi Zhang

Designing model-free algorithms for distributionally robust reinforcement learning (DRRL) poses fundamental challenges. The robust Bellman operator is nonlinear in the transition kernel, which makes one-sample Bellman updates biased, while the adversarial optimization underlying robustness makes robust evaluation computationally demanding. To address these difficulties, we consider the natural small-ambiguity regime under Kullback--Leibler ambiguity sets and propose an approximate DRRL framework based on a first-order expansion of the relevant robust functional. This yields an approximate robust Bellman equation that removes the adversarial optimization while remaining first-order accurate in the ambiguity radius. To learn the fixed point of this approximate equation, we propose Mean-Variance Stochastic Approximation (MVSA), a model-free algorithm that uses only one-sample updates. This is achieved via a lifted stochastic approximation dynamics and a two-time-scale design. We then prove convergence and a central limit theorem for MVSA: its main iterate satisfies a central limit theorem at the canonical $n^{-1/2}$ scale, with explicitly characterized asymptotic covariances. Finally, we validate our theoretical findings with a numerical experiment.

62.3NCMay 5
Bio-plausible Neuromorphic Disturbance Observer Based on Emulation Theory: Extended Version

Hongfu Xu, Xiaoyu Guo, Shengbo Wang et al.

Biological neural systems achieve remarkable robustness and adaptability in uncertain environments through sparse, event-driven spike-based information processing and adaptive regulation. Inspired by this paradigm, this paper develops a neuromorhpic disturbance observer (NDO) and control framework that replaces conventional continuous-time signal representations with spike-timing encoding. Both disturbance estimates and control inputs are constructed via integrate-and-fire (IF) neuron dynamics from discrete spike events, yielding intrinsically event-driven updates. An adaptive-threshold triggering mechanism is inspired by spike-frequency adaptation (SFA), enabling history-dependent regulation of spike generation. Simulation results demonstrate that the proposed framework achieves neurally inspired robustness and adaptability, while the adaptive-threshold spiking scheme reduces spike events to 42.6% of the fixed-threshold case under noisy conditions.

AINov 23, 2023
Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

Tian Huang, Shengbo Wang, Ke Li

Optimization problems find widespread use in both single-objective and multi-objective scenarios. In practical applications, users aspire for solutions that converge to the region of interest (ROI) along the Pareto front (PF). While the conventional approach involves approximating a fitness function or an objective function to reflect user preferences, this paper explores an alternative avenue. Specifically, we aim to discover a method that sidesteps the need for calculating the fitness function, relying solely on human feedback. Our proposed approach entails conducting direct preference learning facilitated by an active dueling bandit algorithm. The experimental phase is structured into three sessions. Firstly, we assess the performance of our active dueling bandit algorithm. Secondly, we implement our proposed method within the context of Multi-objective Evolutionary Algorithms (MOEAs). Finally, we deploy our method in a practical problem, specifically in protein structure prediction (PSP). This research presents a novel interactive preference-based MOEA framework that not only addresses the limitations of traditional techniques but also unveils new possibilities for optimization problems.

LGMar 3
Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

Shengbo Wang

We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

LGDec 6, 2023
Constrained Bayesian Optimization Under Partial Observations: Balanced Improvements and Provable Convergence

Shengbo Wang, Ke Li

The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduces balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose a Gaussian process embedding different likelihoods as the surrogate model for a partially observable constraint. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.

LGNov 12, 2025
Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback

Shengbo Wang, Hong Sun, Ke Li

Interactive preference elicitation (IPE) aims to substantially reduce human effort while acquiring human preferences in wide personalization systems. Dueling bandit (DB) algorithms enable optimal decision-making in IPE building on pairwise comparisons. However, they remain inefficient when human feedback is sparse. Existing methods address sparsity by heavily relying on parametric reward models, whose rigid assumptions are vulnerable to misspecification. In contrast, we explore an alternative perspective based on feedback augmentation, and introduce critical improvements to the model-free DB framework. Specifically, we introduce augmented confidence bounds to integrate augmented human feedback under generalized concentration properties, and analyze the multi-factored performance trade-off via regret analysis. Our prototype algorithm achieves competitive performance across several IPE benchmarks, including recommendation, multi-objective optimization, and response optimization for large language models, demonstrating the potential of our approach for provably efficient IPE in broader applications.

LGMay 15, 2025
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

Zijun Chen, Shengbo Wang, Nian Si · stanford

Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

ASNov 27, 2024
Wearable intelligent throat enables natural speech in stroke patients with dysarthria

Chenyu Tang, Shuo Gao, Cong Li et al.

Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT's LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems.

ETFeb 25, 2024
Lightweight, error-tolerant edge detection using memristor-enabled stochastic logics

Lekai Song, Pengyu Liu, Jingfang Pei et al.

The demand for efficient edge vision has spurred the interest in developing stochastic computing approaches for performing image processing tasks. Memristors with inherent stochasticity readily introduce probability into the computations and thus enable stochastic image processing computations. Here, we present a stochastic computing approach for edge detection, a fundamental image processing technique, facilitated with memristor-enabled stochastic logics. Specifically, we integrate the memristors with logic circuits and harness the stochasticity from the memristors to realize compact stochastic logics for stochastic number encoding and processing. The stochastic numbers, exhibiting well-regulated probabilities and correlations, can be processed to perform logic operations with statistical probabilities. This can facilitate lightweight stochastic edge detection for edge visual scenarios characterized with high-level noise errors. As a practical demonstration, we implement a hardware stochastic Roberts cross operator using the stochastic logics, and prove its exceptional edge detection performance, remarkably, with 95% less computational cost while withstanding 50% bit-flip errors. The results underscore the great potential of our stochastic edge detection approach in developing lightweight, error-tolerant edge vision hardware and systems for autonomous driving, virtual/augmented reality, medical imaging diagnosis, industrial automation, and beyond.

OCSep 17, 2025
Bellman Optimality of Average-Reward Robust Markov Decision Processes with a Constant Gain

Shengbo Wang, Nian Si · stanford

Learning and optimal control under robust Markov decision processes (MDPs) have received increasing attention, yet most existing theory, algorithms, and applications focus on finite-horizon or discounted models. Long-run average-reward formulations, while natural in many operations research and management contexts, remain underexplored. This is primarily because the dynamic programming foundations are technically challenging and only partially understood, with several fundamental questions remaining open. This paper steps toward a general framework for average-reward robust MDPs by analyzing the constant-gain setting. We study the average-reward robust control problem with possible information asymmetries between the controller and an S-rectangular adversary. Our analysis centers on the constant-gain robust Bellman equation, examining both the existence of solutions and their relationship to the optimal average reward. Specifically, we identify when solutions to the robust Bellman equation characterize the optimal average reward and stationary policies, and we provide one-sided weak communication conditions ensuring solutions' existence. These findings expand the dynamic programming theory for average-reward robust MDPs and lay a foundation for robust dynamic decision making under long-run average criteria in operational environments.

AIAug 18, 2025
EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

Shengbo Wang, Mingwei Liu, Zike Li et al.

The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48\%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step logical reasoning by relying on simplistic and fuzzy conditions, consequently leading to incorrect solutions. We define this phenomenon as the ``Pseudo Aha Moment", which we find accounts for 77\% to 100\% of errors on targeted problems. Code and resources are available at: https://anonymous.4open.science/r/EvolMathEval

LGMay 18, 2025
Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

Zhenghao Li, Shengbo Wang, Nian Si · stanford

Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-γ)^{-4}\varepsilon^{-2})$, where $\varepsilon$ is the target accuracy, $|\mathcal{S}|$ and $|\mathcal{A}|$ denote the cardinalities of the state and action spaces, and $γ$ is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on $|\mathcal{S}|$, $|\mathcal{A}|$, and $\varepsilon$ simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.

LGMay 19, 2025
Hardware-Adaptive and Superlinear-Capacity Memristor-based Associative Memory

Chengping He, Mingrui Jiang, Keyi Shan et al.

Brain-inspired computing aims to mimic cognitive functions like associative memory, the ability to recall complete patterns from partial cues. Memristor technology offers promising hardware for such neuromorphic systems due to its potential for efficient in-memory analog computing. Hopfield Neural Networks (HNNs) are a classic model for associative memory, but implementations on conventional hardware suffer from efficiency bottlenecks, while prior memristor-based HNNs faced challenges with vulnerability to hardware defects due to offline training, limited storage capacity, and difficulty processing analog patterns. Here we introduce and experimentally demonstrate on integrated memristor hardware a new hardware-adaptive learning algorithm for associative memories that significantly improves defect tolerance and capacity, and naturally extends to scalable multilayer architectures capable of handling both binary and continuous patterns. Our approach achieves 3x effective capacity under 50% device faults compared to state-of-the-art methods. Furthermore, its extension to multilayer architectures enables superlinear capacity scaling (\(\propto N^{1.49}\ for binary patterns) and effective recalling of continuous patterns (\propto N^{1.74}\ scaling), as compared to linear capacity scaling for previous HNNs. It also provides flexibility to adjust capacity by tuning hidden neurons for the same-sized patterns. By leveraging the massive parallelism of the hardware enabled by synchronous updates, it reduces energy by 8.8x and latency by 99.7% for 64-dimensional patterns over asynchronous schemes, with greater improvements at scale. This promises the development of more reliable memristor-based associative memory systems and enables new applications research due to the significantly improved capacity, efficiency, and flexibility.

SYMar 25, 2025
Optimal Parameter Adaptation for Safety-Critical Control via Safe Barrier Bayesian Optimization

Shengbo Wang, Ke Li, Zheng Yan et al.

Safety is of paramount importance in control systems to avoid costly risks and catastrophic damages. The control barrier function (CBF) method, a promising solution for safety-critical control, poses a new challenge of enhancing control performance due to its direct modification of original control design and the introduction of uncalibrated parameters. In this work, we shed light on the crucial role of configurable parameters in the CBF method for performance enhancement with a systematical categorization. Based on that, we propose a novel framework combining the CBF method with Bayesian optimization (BO) to optimize the safe control performance. Considering feasibility/safety-critical constraints, we develop a safe version of BO using the barrier-based interior method to efficiently search for promising feasible configurable parameters. Furthermore, we provide theoretical criteria of our framework regarding safety and optimality. An essential advantage of our framework lies in that it can work in model-agnostic environments, leaving sufficient flexibility in designing objective and constraint functions. Finally, simulation experiments on swing-up control and high-fidelity adaptive cruise control are conducted to demonstrate the effectiveness of our framework.

HCNov 28, 2024
An AI-Driven Multimodal Smart Home Platform for Continuous Monitoring and Assistance in Post-Stroke Motor Impairment

Chenyu Tang, Ruizhi Zhang, Shuo Gao et al.

At-home rehabilitation for post-stroke patients presents significant challenges, as continuous, personalized care is often limited outside clinical settings. Moreover, the lack of integrated solutions capable of simultaneously monitoring motor recovery and providing intelligent assistance in home environments hampers rehabilitation outcomes. Here, we present a multimodal smart home platform designed for continuous, at-home rehabilitation of post-stroke patients, integrating wearable sensing, ambient monitoring, and adaptive automation. A plantar pressure insole equipped with a machine learning pipeline classifies users into motor recovery stages with up to 94\% accuracy, enabling quantitative tracking of walking patterns during daily activities. An optional head-mounted eye-tracking module, together with ambient sensors such as cameras and microphones, supports seamless hands-free control of household devices with a 100\% success rate and sub-second response time. These data streams are fused locally via a hierarchical Internet of Things (IoT) architecture, ensuring low latency and data privacy. An embedded large language model (LLM) agent, Auto-Care, continuously interprets multimodal data to provide real-time interventions -- issuing personalized reminders, adjusting environmental conditions, and notifying caregivers. Implemented in a post-stroke context, this integrated smart home platform increased mean user satisfaction from 3.9 $\pm$ 0.8 in conventional home environments to 8.4 $\pm$ 0.6 with the full system ($n=20$). Beyond stroke, the system offers a scalable, patient-centered framework with potential for long-term use in broader neurorehabilitation and aging-in-place applications.

MLJun 17, 2024
Learning Optimal Distributionally Robust Stochastic Control in Continuous State Spaces

Shengbo Wang, Jason Meng, Nian Si et al.

We study data-driven learning of robust stochastic control for infinite-horizon systems with potentially continuous state and action spaces. In many managerial settings--supply chains, finance, manufacturing, services, and dynamic games--the state-transition mechanism is determined by system design, while available data capture the distributional properties of the stochastic inputs from the environment. For modeling and computational tractability, a decision maker often adopts a Markov control model with i.i.d. environment inputs, which can render learned policies fragile to internal dependence or external perturbations. We introduce a distributionally robust stochastic control paradigm that promotes policy reliability by introducing adaptive adversarial perturbations to the environment input, while preserving the modeling, statistical, and computational tractability of the Markovian formulation. From a modeling perspective, we examine two adversary models--current-action-aware and current-action-unaware--leading to distinct dynamic behaviors and robust optimal policies. From a statistical learning perspective, we characterize optimal finite-sample minimax rates for uniform learning of the robust value function across a continuum of states under ambiguity sets defined by the $f_k$-divergence and Wasserstein distance. To efficiently compute the optimal robust policies, we further propose algorithms inspired by deep reinforcement learning methodologies. Finally, we demonstrate the applicability of the framework to real managerial problems.

LGMay 28, 2023
Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Shengbo Wang, Nian Si, Jose Blanchet et al.

Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the $q$-function of an infinite-horizon $γ$-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise $ε$-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of $\tilde O(|\mathbf{S}||\mathbf{A}|(1-γ)^{-4}ε^{-2})$, where $\mathbf{S}$ and $\mathbf{A}$ denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size $δ$, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.

LGNov 15, 2021
AutoGMap: Learning to Map Large-scale Sparse Graphs on Memristive Crossbars

Bo Lyu, Shengbo Wang, Shiping Wen et al.

The sparse representation of graphs has shown great potential for accelerating the computation of graph applications (e.g., Social Networks, Knowledge Graphs) on traditional computing architectures (CPU, GPU, or TPU). But the exploration of large-scale sparse graph computing on processing-in-memory (PIM) platforms (typically with memristive crossbars) is still in its infancy. To implement the computation or storage of large-scale or batch graphs on memristive crossbars, a natural assumption is that a large-scale crossbar is demanded, but with low utilization. Some recent works question this assumption, to avoid the waste of storage and computational resource, the fixed-size or progressively scheduled ''block partition'' schemes are proposed. However, these methods are coarse-grained or static, and are not effectively sparsity-aware. This work proposes the dynamic sparsity-aware mapping scheme generating method that models the problem with a sequential decision-making model, and optimizes it by reinforcement learning (RL) algorithm (REINFORCE). Our generating model (LSTM, combined with the dynamic-fill scheme) generates remarkable mapping performance on a small-scale graph/matrix data (complete mapping costs 43% area of the original matrix) and two large-scale matrix data (costing 22.5% area on qh882 and 17.1% area on qh1484). Our method may be extended to sparse graph computing on other PIM architectures, not limited to the memristive device-based platforms.