Vaibhav Srivastava

SY
h-index5
24papers
569citations
Novelty50%
AI Score55

24 Papers

SYApr 20, 2024
Human Motor Learning Dynamics in High-dimensional Tasks

Ankur Kamboj, Rajiv Ranganathan, Xiaobo Tan et al.

Conventional approaches to enhancing movement coordination, such as providing instructions and visual feedback, are often inadequate in complex motor tasks with multiple degrees of freedom (DoFs). To effectively address coordination deficits in such complex motor systems, it becomes imperative to develop interventions grounded in a model of human motor learning; however, modeling such learning processes is challenging due to the large DoFs. In this paper, we present a computational motor learning model that leverages the concept of motor synergies to extract low-dimensional learning representations in the high-dimensional motor space and the internal model theory of motor control to capture both fast and slow motor learning processes. We establish the model's convergence properties and validate it using data from a target capture game played by human participants. We study the influence of model parameters on several motor learning trade-offs such as speed-accuracy, exploration-exploitation, satisficing, and flexibility-performance, and show that the human motor learning system tunes these parameters to optimize learning and various output performance metrics.

LGSep 12, 2022
Deterministic Sequencing of Exploration and Exploitation for Reinforcement Learning

Piyush Gupta, Vaibhav Srivastava

We propose Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., a Markov decision process (MDP), and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the expected reward and transition probabilities are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time.

SYMay 14
Automated Curriculum Design for High-dimensional Human Motor Learning

Ankur Kamboj, Rajiv Ranganathan, Xiaobo Tan et al.

Designing effective practice schedules for high-dimensional motor learning tasks remains a challenge, especially when skill states are unobservable and task performance may not reflect the true learning. We propose an automated curriculum design framework that combines a human motor learning model and personalized real-time skill estimation with Stochastic Nonlinear Model Predictive Control in \emph{de-novo} (novel) motor learning paradigms. We validated our framework both through simulations and human-subject studies (N = 36) using a hand exoskeleton. Our proposed approach accelerates skill acquisition by $\sim23\%$, and ${\sim17\%}$ when compared to a random curriculum and a performance heuristics-based curriculum, respectively. These significant gains in learning efficiency highlight the potential of model-based, individualized curricula for motor rehabilitation and complex skill training.

SYMar 11
Multi-Robot Multitask Gaussian Process Estimation and Coverage

Lai Wei, Andrew McDonald, Vaibhav Srivastava

Coverage control is essential for the optimal deployment of agents to monitor or cover areas with sensory demands. While traditional coverage involves single-task robots, increasing autonomy now enables multitask operations. This paper introduces a novel multitask coverage problem and addresses it for both the cases of known and unknown sensory demands. For known demands, we design a federated multitask coverage algorithm and establish its convergence properties. For unknown demands, we employ a multitask Gaussian Process (GP) framework to learn sensory demand functions and integrate it with the multitask coverage algorithm to develop an adaptive algorithm. We introduce a novel notion of multitask coverage regret that compares the performance of the adaptive algorithm against an oracle with prior knowledge of the demand functions. We establish that our algorithm achieves sublinear cumulative regret, and numerically illustrate its performance.

SYApr 28
Co-Learning Port-Hamiltonian Systems and Optimal Energy-Shaping Control

Ankur Kamboj, Biswadip Dey, Vaibhav Srivastava

We develop a physics-informed learning framework for energy-shaping control of port-Hamiltonian (pH) systems from trajectory data. The proposed approach {co-learns} a pH system model and an optimal energy-balancing passivity-based controller (EB-PBC) through alternating optimization with policy-aware data collection. At each iteration, the system model is refined using trajectory data collected under the current control policy, and the controller is re-optimized on the updated model. Both components are parameterized by neural networks that embed the pH {dynamics} and EB-PBC structure, ensuring interpretability in terms of energy {interactions}. The learned controller renders the closed-loop system inherently passive and provably stable, and exploits passive plant dynamics without canceling the natural potential. A dissipation regularization enforces strict energy decay during training, thereby enhancing robustness to sim-to-real gaps. The proposed framework is validated on state-regulation and swing-up tasks for planar and torsional pendulum systems.

ROMar 13
Skill-informed Data-driven Haptic Nudges for High-dimensional Human Motor Learning

Ankur Kamboj, Rajiv Ranganathan, Xiaobo Tan et al.

In this work, we propose a data-driven skill-informed framework to design optimal haptic nudge feedback for high-dimensional novel motor learning tasks. We first model the stochastic dynamics of human motor learning using an Input-Output Hidden Markov Model (IOHMM), which explicitly decouples latent skill evolution from observable kinematic emissions. Leveraging this predictive model, we formulate the haptic nudge feedback design problem as a Partially Observable Markov Decision Process (POMDP). This allows us to derive an optimal nudging policy that minimizes long-term performance cost, implicitly guiding the learner toward robust regions of the skill space. We validated our approach through a human-subject study ($N=30$) using a high-dimensional hand-exoskeleton task. Results demonstrate that participants trained with the POMDP-derived policy exhibited significantly accelerated task performance compared to groups receiving heuristic-based feedback or no feedback. Furthermore, synergy analysis revealed that the POMDP group discovered efficient low-dimensional motor representations more rapidly.

AIJul 4, 2025
LogicGuard: Improving Embodied LLM agents through Temporal Logic based Critics

Anand Gokhale, Vaibhav Srivastava, Francesco Bullo

Large language models (LLMs) have shown promise in zero-shot and single step reasoning and decision making problems, but in long horizon sequential planning tasks, their errors compound, often leading to unreliable or inefficient behavior. We introduce LogicGuard, a modular actor-critic architecture in which an LLM actor is guided by a trajectory level LLM critic that communicates through Linear Temporal Logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. LogicGuard supports both fixed safety rules and adaptive, learned constraints, and is model-agnostic: any LLM-based planner can serve as the actor, with LogicGuard acting as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LogicGuard to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. To demonstrate generality, we evaluate LogicGuard across two distinct settings: short-horizon general tasks and long-horizon specialist tasks. On the Behavior benchmark of 100 household tasks, LogicGuard increases task completion rates by 25% over a baseline InnerMonologue planner. On the Minecraft diamond-mining task, which is long-horizon and requires multiple interdependent subgoals, LogicGuard improves both efficiency and safety compared to SayCan and InnerMonologue. These results show that enabling LLMs to supervise each other through temporal logic yields more reliable, efficient and safe decision-making for both embodied agents.

SYFeb 6, 2022
Towards Modeling Human Motor Learning Dynamics in High-Dimensional Spaces

Ankur Kamboj, Rajiv Ranganathan, Xiaobo Tan et al.

Designing effective rehabilitation strategies for upper extremities, particularly hands and fingers, warrants the need for a computational model of human motor learning. The presence of large degrees of freedom (DoFs) available in these systems makes it difficult to balance the trade-off between learning the full dexterity and accomplishing manipulation goals. The motor learning literature argues that humans use motor synergies to reduce the dimension of control space. Using the low-dimensional space spanned by these synergies, we develop a computational model based on the internal model theory of motor control. We analyze the proposed model in terms of its convergence properties and fit it to the data collected from human experiments. We compare the performance of the fitted model to the experimental data and show that it captures human motor learning behavior well.

HCJan 24, 2022
Structural Properties of Optimal Fidelity Selection Policies for Human-in-the-loop Queues

Piyush Gupta, Vaibhav Srivastava

We study optimal fidelity selection for a human operator servicing a queue of homogeneous tasks. The agent can service a task with a normal or high fidelity level, where fidelity refers to the degree of exactness and precision while servicing the task. Therefore, high-fidelity servicing results in higher-quality service but leads to larger service times and increased operator tiredness. We treat the human cognitive state as a lumped parameter that captures psychological factors such as workload and fatigue. The operator's service time distribution depends on her cognitive dynamics and the fidelity level selected for servicing the task. Her cognitive dynamics evolve as a Markov chain in which the cognitive state increases with high probability whenever she is busy and decreases while resting. The tasks arrive according to a Poisson process and the operator is penalized at a fixed rate for each task waiting in the queue. We address the trade-off between high-quality service of the task and consequent penalty due to a subsequent increase in queue length using a discrete-time Semi-Markov Decision Process framework. We numerically determine an optimal policy and the corresponding optimal value function. Finally, we establish the structural properties of an optimal fidelity policy and provide conditions under which the optimal policy is a threshold-based policy.

ROJun 28, 2021
Online Estimation and Coverage Control with Heterogeneous Sensing Information

Andrew McDonald, Lai Wei, Vaibhav Srivastava

Heterogeneous multi-robot sensing systems are able to characterize physical processes more comprehensively than homogeneous systems. Access to multiple modalities of sensory data allow such systems to fuse information between complementary sources and learn richer representations of a phenomenon of interest. Often, these data are correlated but vary in fidelity, i.e., accuracy (bias) and precision (noise). Low-fidelity data may be more plentiful, while high-fidelity data may be more trustworthy. In this paper, we address the problem of multi-robot online estimation and coverage control by combining low- and high-fidelity data to learn and cover a sensory function of interest. We propose two algorithms for this task of heterogeneous learning and coverage -- namely Stochastic Sequencing of Multi-fidelity Learning and Coverage (SMLC) and Deterministic Sequencing of Multi-fidelity Learning and Coverage (DMLC) -- and prove that they converge asymptotically. In addition, we demonstrate the empirical efficacy of SMLC and DMLC through numerical simulations.

LGJan 22, 2021
Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

Lai Wei, Vaibhav Srivastava

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative rewards obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We extend Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window and discount factor and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions the proposed polices that can handle heavy-tailed reward distributions and maintain their performance guarantees.

ROJan 12, 2021
Multi-Robot Gaussian Process Estimation and Coverage: A Deterministic Sequencing Algorithm and Regret Analysis

Lai Wei, Andrew McDonald, Vaibhav Srivastava

We study the problem of distributed multi-robot coverage over an unknown, nonuniform sensory field. Modeling the sensory field as a realization of a Gaussian Process and using Bayesian techniques, we devise a policy which aims to balance the tradeoff between learning the sensory function and covering the environment. We propose an adaptive coverage algorithm called Deterministic Sequencing of Learning and Coverage (DSLC) that schedules learning and coverage epochs such that its emphasis gradually shifts from exploration to exploitation while never fully ceasing to learn. Using a novel definition of coverage regret which characterizes overall coverage performance of a multi-robot team over a time horizon $T$, we analyze DSLC to provide an upper bound on expected cumulative coverage regret. Finally, we illustrate the empirical performance of the algorithm through simulations of the coverage task over an unknown distribution of wildfires.

MLJul 20, 2020
Minimax Policy for Heavy-tailed Bandits

Lai Wei, Vaibhav Srivastava

We study the stochastic Multi-Armed Bandit (MAB) problem under worst-case regret and heavy-tailed reward distribution. We modify the minimax policy MOSS for the sub-Gaussian reward distribution by using saturated empirical mean to design a new algorithm called Robust MOSS. We show that if the moment of order $1+ε$ for the reward distribution exists, then the refined strategy has a worst-case regret matching the lower bound while maintaining a distribution-dependent logarithm regret.

ROMay 18, 2020
Expedited Multi-Target Search with Guaranteed Performance via Multi-fidelity Gaussian Processes

Lai Wei, Xiaobo Tan, Vaibhav Srivastava

We consider a scenario in which an autonomous vehicle equipped with a downward facing camera operates in a 3D environment and is tasked with searching for an unknown number of stationary targets on the 2D floor of the environment. The key challenge is to minimize the search time while ensuring a high detection accuracy. We model the sensing field using a multi-fidelity Gaussian process that systematically describes the sensing information available at different altitudes from the floor. Based on the sensing model, we design a novel algorithm called Expedited Multi-Target Search (EMTS) that (i) addresses the coverage-accuracy trade-off: sampling at locations farther from the floor provides wider field of view but less accurate measurements, (ii) computes an occupancy map of the floor within a prescribed accuracy and quickly eliminates unoccupied regions from the search space, and (iii) travels efficiently to collect the required samples for target detection. We rigorously analyze the algorithm and establish formal guarantees on the target detection accuracy and the expected detection time. We illustrate the algorithm using a simulated multi-target search scenario.

OCMar 3, 2020
Distributed Cooperative Decision Making in Multi-agent Multi-armed Bandits

Peter Landgren, Vaibhav Srivastava, Naomi Ehrich Leonard

We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.

MLDec 12, 2018
On Distributed Multi-player Multiarmed Bandit Problems in Abruptly Changing Environment

Lai Wei, Vaibhav Srivastava

We study the multi-player stochastic multiarmed bandit (MAB) problem in an abruptly changing environment. We consider a collision model in which a player receives reward at an arm if it is the only player to select the arm. We design two novel algorithms, namely, Round-Robin Sliding-Window Upper Confidence Bound\# (RR-SW-UCB\#), and the Sliding-Window Distributed Learning with Prioritization (SW-DLP). We rigorously analyze these algorithms and show that the expected cumulative group regret for these algorithms is upper bounded by sublinear functions of time, i.e., the time average of the regret asymptotically converges to zero. We complement our analytic results with numerical illustrations.

MLFeb 23, 2018
On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems

Lai Wei, Vaibhav Srivastava

We study the non-stationary stochastic multiarmed bandit (MAB) problem and propose two generic algorithms, namely, the limited memory deterministic sequencing of exploration and exploitation (LM-DSEE) and the Sliding-Window Upper Confidence Bound# (SW-UCB#). We rigorously analyze these algorithms in abruptly-changing and slowly-varying environments and characterize their performance. We show that the expected cumulative regret for these algorithms under either of the environments is upper bounded by sublinear functions of time, i.e., the time average of the regret asymptotically converges to zero. We complement our analytic results with numerical illustrations.

SYJun 2, 2016
Distributed Cooperative Decision-Making in Multiarmed Bandits: Frequentist and Bayesian Algorithms

Peter Landgren, Vaibhav Srivastava, Naomi Ehrich Leonard

We study distributed cooperative decision-making under the explore-exploit tradeoff in the multiarmed bandit (MAB) problem. We extend the state-of-the-art frequentist and Bayesian algorithms for single-agent MAB problems to cooperative distributed algorithms for multi-agent MAB problems in which agents communicate according to a fixed network graph. We rely on a running consensus algorithm for each agent's estimation of mean rewards from its own rewards and the estimated rewards of its neighbors. We prove the performance of these algorithms and show that they asymptotically recover the performance of a centralized agent. Further, we rigorously characterize the influence of the communication graph structure on the decision-making performance of the group.

LGDec 23, 2015
Satisficing in multi-armed bandit problems

Paul Reverdy, Vaibhav Srivastava, Naomi Ehrich Leonard

Satisficing is a relaxation of maximizing and allows for less risky decision making in the face of uncertainty. We propose two sets of satisficing objectives for the multi-armed bandit problem, where the objective is to achieve reward-based decision-making performance above a given threshold. We show that these new problems are equivalent to various standard multi-armed bandit problems with maximizing objectives and use the equivalence to find bounds on performance. The different objectives can result in qualitatively different behavior; for example, agents explore their options continually in one case and only a finite number of times in another. For the case of Gaussian rewards we show an additional equivalence between the two sets of satisficing objectives that allows algorithms developed for one set to be applied to the other. We then develop variants of the Upper Credible Limit (UCL) algorithm that solve the problems with satisficing objectives and show that these modified UCL algorithms achieve efficient satisficing performance.

SYDec 21, 2015
On Distributed Cooperative Decision-Making in Multiarmed Bandits

Peter Landgren, Vaibhav Srivastava, Naomi Ehrich Leonard

We study the explore-exploit tradeoff in distributed cooperative decision-making using the context of the multiarmed bandit (MAB) problem. For the distributed cooperative MAB problem, we design the cooperative UCB algorithm that comprises two interleaved distributed processes: (i) running consensus algorithms for estimation of rewards, and (ii) upper-confidence-bound-based heuristics for selection of arms. We rigorously analyze the performance of the cooperative UCB algorithm and characterize the influence of communication graph structure on the decision-making performance of the group.

OCJul 5, 2015
Correlated Multiarmed Bandit Problem: Bayesian Algorithms and Regret Analysis

Vaibhav Srivastava, Paul Reverdy, Naomi Ehrich Leonard

We consider the correlated multiarmed bandit (MAB) problem in which the rewards associated with each arm are modeled by a multivariate Gaussian random variable, and we investigate the influence of the assumptions in the Bayesian prior on the performance of the upper credible limit (UCL) algorithm and a new correlated UCL algorithm. We rigorously characterize the influence of accuracy, confidence, and correlation scale in the prior on the decision-making performance of the algorithms. Our results show how priors and correlation structure can be leveraged to improve performance.

OCNov 12, 2013
Mixed Human-Robot Team Surveillance

Vaibhav Srivastava, Amit Surana, Miguel P. Eckstein et al.

We study the mixed human-robot team design in a system theoretic setting using the context of a surveillance mission. The three key coupled components of a mixed team design are (i) policies for the human operator, (ii) policies to account for erroneous human decisions, and (iii) policies to control the automaton. In this paper, we survey elements of human decision-making, including evidence aggregation, situational awareness, fatigue, and memory effects. We bring together the models for these elements in human decision-making to develop a single coherent model for human decision-making in a two-alternative choice task. We utilize the developed model to design efficient attention allocation policies for the human operator. We propose an anomaly detection algorithm that utilizes potentially erroneous decision by the operator to ascertain an anomalous region among the set of regions surveilled. Finally, we propose a stochastic vehicle routing policy that surveils an anomalous region with high probability. Our mixed team design relies on the certainty-equivalent receding-horizon control framework.

LGJul 23, 2013
Modeling Human Decision-making in Generalized Gaussian Multi-armed Bandits

Paul Reverdy, Vaibhav Srivastava, Naomi E. Leonard

We present a formal model of human decision-making in explore-exploit tasks using the context of multi-armed bandit problems, where the decision-maker must choose among multiple options with uncertain rewards. We address the standard multi-armed bandit problem, the multi-armed bandit problem with transition costs, and the multi-armed bandit problem on graphs. We focus on the case of Gaussian rewards in a setting where the decision-maker uses Bayesian inference to estimate the reward values. We model the decision-maker's prior knowledge with the Bayesian prior on the mean reward. We develop the upper credible limit (UCL) algorithm for the standard multi-armed bandit problem and show that this deterministic algorithm achieves logarithmic cumulative expected regret, which is optimal performance for uninformative priors. We show how good priors and good assumptions on the correlation structure among arms can greatly enhance decision-making performance, even over short time horizons. We extend to the stochastic UCL algorithm and draw several connections to human decision-making behavior. We present empirical data from human experiments and show that human performance is efficiently captured by the stochastic UCL algorithm with appropriate parameters. For the multi-armed bandit problem with transition costs and the multi-armed bandit problem on graphs, we generalize the UCL algorithm to the block UCL algorithm and the graphical block UCL algorithm, respectively. We show that these algorithms also achieve logarithmic cumulative expected regret and require a sub-logarithmic expected number of transitions among arms. We further illustrate the performance of these algorithms with numerical examples. NB: Appendix G included in this version details minor modifications that correct for an oversight in the previously-published proofs. The remainder of the text reflects the published work.

ROOct 12, 2012
Stochastic Surveillance Strategies for Spatial Quickest Detection

Vaibhav Srivastava, Fabio Pasqualetti, Francesco Bullo

We design persistent surveillance strategies for the quickest detection of anomalies taking place in an environment of interest. From a set of predefined regions in the environment, a team of autonomous vehicles collects noisy observations, which a control center processes. The overall objective is to minimize detection delay while maintaining the false alarm rate below a desired threshold. We present joint (i) anomaly detection algorithms for the control center and (ii) vehicle routing policies. For the control center, we propose parallel cumulative sum (CUSUM) algorithms (one for each region) to detect anomalies from noisy observations. For the vehicles, we propose a stochastic routing policy, in which the regions to be visited are chosen according to a probability vector. We study stationary routing policy (the probability vector is constant) as well as adaptive routing policies (the probability vector varies in time as a function of the likelihood of regional anomalies). In the context of stationary policies, we design a performance metric and minimize it to design an efficient stationary routing policy. Our adaptive policy improves upon the stationary counterpart by adaptively increasing the selection probability of regions with high likelihood of anomaly. Finally, we show the effectiveness of the proposed algorithms through numerical simulations and a persistent surveillance experiment.