Michael M. Zavlanos

LG
h-index40
52papers
1,153citations
Novelty54%
AI Score56

52 Papers

LGSep 6, 2022
A Zeroth-Order Momentum Method for Risk-Averse Online Convex Games

Zifan Wang, Yi Shen, Zachary I. Bell et al.

We consider risk-averse learning in repeated unknown games where the goal of the agents is to minimize their individual risk of incurring significantly high cost. Specifically, the agents use the conditional value at risk (CVaR) as a risk measure and rely on bandit feedback in the form of the cost values of the selected actions at every episode to estimate their CVaR values and update their actions. A major challenge in using bandit feedback to estimate CVaR is that the agents can only access their own cost values, which, however, depend on the actions of all agents. To address this challenge, we propose a new risk-averse learning algorithm with momentum that utilizes the full historical information on the cost values. We show that this algorithm achieves sub-linear regret and matches the best known algorithms in the literature. We provide numerical experiments for a Cournot game that show that our method outperforms existing methods.

OCMar 23, 2023
Policy Evaluation in Distributional LQR

Zifan Wang, Yulong Gao, Siyi Wang et al.

Distributional reinforcement learning (DRL) enhances the understanding of the effects of the randomness in the environment by letting agents learn the distribution of a random return, rather than its expected value as in standard RL. At the same time, a main challenge in DRL is that policy evaluation in DRL typically relies on the representation of the return distribution, which needs to be carefully designed. In this paper, we address this challenge for a special class of DRL problems that rely on linear quadratic regulator (LQR) for control, advocating for a new distributional approach to LQR, which we call \emph{distributional LQR}. Specifically, we provide a closed-form expression of the distribution of the random return which, remarkably, is applicable to all exogenous disturbances on the dynamics, as long as they are independent and identically distributed (i.i.d.). While the proposed exact return distribution consists of infinitely many random variables, we show that this distribution can be approximated by a finite number of random variables, and the associated approximation error can be analytically bounded under mild assumptions. Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR using the Conditional Value at Risk (CVaR) as a measure of risk. Numerical experiments are provided to illustrate our theoretical results.

MASep 30, 2010
Spectral Control of Mobile Robot Networks

Michael M. Zavlanos, Victor M. Preciado, Ali Jadbabaie

The eigenvalue spectrum of the adjacency matrix of a network is closely related to the behavior of many dynamical processes run over the network. In the field of robotics, this spectrum has important implications in many problems that require some form of distributed coordination within a team of robots. In this paper, we propose a continuous-time control scheme that modifies the structure of a position-dependent network of mobile robots so that it achieves a desired set of adjacency eigenvalues. For this, we employ a novel abstraction of the eigenvalue spectrum by means of the adjacency matrix spectral moments. Since the eigenvalue spectrum is uniquely determined by its spectral moments, this abstraction provides a way to indirectly control the eigenvalues of the network. Our construction is based on artificial potentials that capture the distance of the network's spectral moments to their desired values. Minimization of these potentials is via a gradient descent closed-loop system that, under certain convexity assumptions, ensures convergence of the network topology to one with the desired set of moments and, therefore, eigenvalues. We illustrate our approach in nontrivial computer simulations.

LGMar 16, 2022
Risk-Averse No-Regret Learning in Online Convex Games

Zifan Wang, Yi Shen, Michael M. Zavlanos

We consider an online stochastic game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs. Specifically, we use the Conditional Value at Risk (CVaR) as a risk measure that the agents can estimate using bandit feedback in the form of the cost values of only their selected actions. Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, they are themselves unknown and, therefore, the CVaR values of the costs are difficult to compute. To address this challenge, we propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values that are estimated by appropriately sampling the cost functions. We show that this algorithm achieves sub-linear regret with high probability. We also propose two variants of this algorithm that improve performance. The first variant relies on a new sampling strategy that uses samples from the previous iteration to improve the estimation accuracy of the CVaR values. The second variant employs residual feedback that uses CVaR values from the previous iteration to reduce the variance of the CVaR gradient estimates. We theoretically analyze the convergence properties of these variants and illustrate their performance on an online market problem that we model as a Cournot game.

LGSep 15, 2023
Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

Yi Shen, Pan Xu, Michael M. Zavlanos

Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

LGSep 9, 2022
Risk-Averse Multi-Armed Bandits with Unobserved Confounders: A Case Study in Emotion Regulation in Mobile Health

Yi Shen, Jessilyn Dunn, Michael M. Zavlanos

In this paper, we consider a risk-averse multi-armed bandit (MAB) problem where the goal is to learn a policy that minimizes the risk of low expected return, as opposed to maximizing the expected return itself, which is the objective in the usual approach to risk-neutral MAB. Specifically, we formulate this problem as a transfer learning problem between an expert and a learner agent in the presence of contexts that are only observable by the expert but not by the learner. Thus, such contexts are unobserved confounders (UCs) from the learner's perspective. Given a dataset generated by the expert that excludes the UCs, the goal for the learner is to identify the true minimum-risk arm with fewer online learning steps, while avoiding possible biased decisions due to the presence of UCs in the expert's data.

LGFeb 18
Efficient Tail-Aware Generative Optimization via Flow Model Fine-Tuning

Zifan Wang, Riccardo De Santi, Xiaoyu Mo et al.

Fine-tuning pre-trained diffusion and flow models to optimize downstream utilities is central to real-world deployment. Existing entropy-regularized methods primarily maximize expected reward, providing no mechanism to shape tail behavior. However, tail control is often essential: the lower tail determines reliability by limiting low-reward failures, while the upper tail enables discovery by prioritizing rare, high-reward outcomes. In this work, we present Tail-aware Flow Fine-Tuning (TFFT), a principled and efficient distributional fine-tuning algorithm based on the Conditional Value-at-Risk (CVaR). We address two distinct tail-shaping goals: right-CVaR for seeking novel samples in the high-reward tail and left-CVaR for controlling worst-case samples in the low-reward tail. Unlike prior approaches that rely on non-linear optimization, we leverage the variational dual formulation of CVaR to decompose it into a decoupled two-stage procedure: a lightweight one-dimensional threshold optimization step, and a single entropy-regularized fine-tuning process via a specific pseudo-reward. This decomposition achieves CVaR fine-tuning efficiently with computational cost comparable to standard expected fine-tuning methods. We demonstrate the effectiveness of TFFT across illustrative experiments, high-dimensional text-to-image generation, and molecular design.

CLDec 9, 2025
An Agentic AI System for Multi-Framework Communication Coding

Bohao Yang, Rui Yang, Joshua M. Biro et al.

Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.

LGFeb 5, 2024
Path Signatures and Graph Neural Networks for Slow Earthquake Analysis: Better Together?

Hans Riess, Manolis Veveakis, Michael M. Zavlanos

The path signature, having enjoyed recent success in the machine learning community, is a theoretically-driven method for engineering features from irregular paths. On the other hand, graph neural networks (GNN), neural architectures for processing data on graphs, excel on tasks with irregular domains, such as sensor networks. In this paper, we introduce a novel approach, Path Signature Graph Convolutional Neural Networks (PS-GCNN), integrating path signatures into graph convolutional neural networks (GCNN), and leveraging the strengths of both path signatures, for feature extraction, and GCNNs, for handling spatial interactions. We apply our method to analyze slow earthquake sequences, also called slow slip events (SSE), utilizing data from GPS timeseries, with a case study on a GPS sensor network on the east coast of New Zealand's north island. We also establish benchmarks for our method on simulated stochastic differential equations, which model similar reaction-diffusion phenomenon. Our methodology shows promise for future advancement in earthquake prediction and sensor network analysis.

LGSep 10, 2025
Group Distributionally Robust Machine Learning under Group Level Distributional Uncertainty

Xenia Konti, Yi Shen, Zifan Wang et al.

The performance of machine learning (ML) models critically depends on the quality and representativeness of the training data. In applications with multiple heterogeneous data generating sources, standard ML methods often learn spurious correlations that perform well on average but degrade performance for atypical or underrepresented groups. Prior work addresses this issue by optimizing the worst-group performance. However, these approaches typically assume that the underlying data distributions for each group can be accurately estimated using the training data, a condition that is frequently violated in noisy, non-stationary, and evolving environments. In this work, we propose a novel framework that relies on Wasserstein-based distributionally robust optimization (DRO) to account for the distributional uncertainty within each group, while simultaneously preserving the objective of improving the worst-group performance. We develop a gradient descent-ascent algorithm to solve the proposed DRO problem and provide convergence results. Finally, we validate the effectiveness of our method on real-world data.

LGAug 20, 2025
Source-Guided Flow Matching

Zifan Wang, Alice Harting, Matthieu Barreau et al.

Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, physics-informed generative tasks, and imaging inverse problems demonstrate the effectiveness and flexibility of the proposed framework.

ROApr 21, 2025
LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

Pingcheng Jian, Xiao Wei, Yanbaihui Liu et al.

We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL). These labels are used to train an online preference predictor, which in turn guides the policy optimization process toward satisfying high-level behavioral specifications provided by humans. Our key technical contribution is the integration of LLMs into the RL feedback loop through trajectory-level preference prediction, enabling robots to acquire complex skills including subtle control over gait patterns and rhythmic timing. We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.

LGMar 12, 2025
Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping

Guangyi Liu, Suzan Iloglu, Michael Caldara et al.

In Amazon robotic warehouses, the destination-to-chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi-Agent Reinforcement Learning (DRMARL) framework that learns a destination-to-chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit-based predictor of the worst-case induction distribution for each state-action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80\% in the simulation scenario.

LGOct 18, 2024
Inverse Reinforcement Learning from Non-Stationary Learning Agents

Kavinayan P. Sivakumar, Yi Shen, Zachary Bell et al.

In this paper, we study an inverse reinforcement learning problem that involves learning the reward function of a learning agent using trajectory data collected while this agent is learning its optimal policy. To address this problem, we propose an inverse reinforcement learning method that allows us to estimate the policy parameters of the learning agent which can then be used to estimate its reward function. Our method relies on a new variant of the behavior cloning algorithm, which we call bundle behavior cloning, and uses a small number of trajectories generated by the learning agent's policy at different points in time to learn a set of policies that match the distribution of actions observed in the sampled trajectories. We then use the cloned policies to train a neural network model that estimates the reward function of the learning agent. We provide a theoretical analysis to show a complexity result on bound guarantees for our method that beats standard behavior cloning as well as numerical experiments for a reinforcement learning problem that validate the proposed method.

SYApr 3, 2024
Risk-averse Learning with Non-Stationary Distributions

Siyi Wang, Zifan Wang, Xinlei Yi et al.

Considering non-stationary environments in online optimization enables decision-maker to effectively adapt to changes and improve its performance over time. In such cases, it is favorable to adopt a strategy that minimizes the negative impact of change to avoid potentially risky situations. In this paper, we investigate risk-averse online optimization where the distribution of the random cost changes over time. We minimize risk-averse objective function using the Conditional Value at Risk (CVaR) as risk measure. Due to the difficulty in obtaining the exact CVaR gradient, we employ a zeroth-order optimization approach that queries the cost function values multiple times at each iteration and estimates the CVaR gradient using the sampled values. To facilitate the regret analysis, we use a variation metric based on Wasserstein distance to capture time-varying distributions. Given that the distribution variation is sub-linear in the total number of episodes, we show that our designed learning algorithm achieves sub-linear dynamic regret with high probability for both convex and strongly convex functions. Moreover, theoretical results suggest that increasing the number of samples leads to a reduction in the dynamic regret bounds until the sampling number reaches a specific limit. Finally, we provide numerical experiments of dynamic pricing in a parking lot to illustrate the efficacy of the designed algorithm.

OCNov 18, 2025
Wasserstein Distributionally Robust Nash Equilibrium Seeking with Heterogeneous Data: A Lagrangian Approach

Zifan Wang, Georgios Pantazis, Sergio Grammatico et al.

We study a class of distributionally robust games where agents are allowed to heterogeneously choose their risk aversion with respect to distributional shifts of the uncertainty. In our formulation, heterogeneous Wasserstein ball constraints on each distribution are enforced through a penalty function leveraging a Lagrangian formulation. We then formulate the distributionally robust Nash equilibrium problem and show that under certain assumptions it is equivalent to a finite-dimensional variational inequality problem with a strongly monotone mapping. We then design an approximate Nash equilibrium seeking algorithm and prove convergence of the average regret to a quantity that diminishes with the number of iterations, thus learning the desired equilibrium up to an a priori specified accuracy. Numerical simulations corroborate our theoretical findings.

LGSep 29, 2025
Distributionally Robust Federated Learning with Outlier Resilience

Zifan Wang, Xinlei Yi, Xenia Konti et al.

Federated learning (FL) enables collaborative model training without direct data sharing, but its performance can degrade significantly in the presence of data distribution perturbations. Distributionally robust optimization (DRO) provides a principled framework for handling this by optimizing performance against the worst-case distributions within a prescribed ambiguity set. However, existing DRO-based FL methods often overlook the detrimental impact of outliers in local datasets, which can disproportionately bias the learned models. In this work, we study distributionally robust federated learning with explicit outlier resilience. We introduce a novel ambiguity set based on the unbalanced Wasserstein distance, which jointly captures geometric distributional shifts and incorporates a non-geometric Kullback--Leibler penalization to mitigate the influence of outliers. This formulation naturally leads to a challenging min--max--max optimization problem. To enable decentralized training, we reformulate the problem as a tractable Lagrangian penalty optimization, which admits robustness certificates. Building on this reformulation, we propose the distributionally outlier-robust federated learning algorithm and establish its convergence guarantees. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach.

LGSep 25, 2025
Federated Flow Matching

Zifan Wang, Anqi Dong, Mahmoud Selim et al.

Data today is decentralized, generated and stored across devices and institutions where privacy, ownership, and regulation prevent centralization. This motivates the need to train generative models directly from distributed data locally without central aggregation. In this paper, we introduce Federated Flow Matching (FFM), a framework for training flow matching models under privacy constraints. Specifically, we first examine FFM-vanilla, where each client trains locally with independent source and target couplings, preserving privacy but yielding curved flows that slow inference. We then develop FFM-LOT, which employs local optimal transport couplings to improve straightness within each client but lacks global consistency under heterogeneous data. Finally, we propose FFM-GOT, a federated strategy based on the semi-dual formulation of optimal transport, where a shared global potential function coordinates couplings across clients. Experiments on synthetic and image datasets show that FFM enables privacy-preserving training while enhancing both the flow straightness and sample quality in federated settings, with performance comparable to the centralized baseline.

LGAug 5, 2025
FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport

Pengxi Liu, Yi Shen, Matthew M. Engelhard et al.

Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment.

LGMay 8, 2025
Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration

Andreas Kontogiannis, Konstantinos Papathanasiou, Yi Shen et al.

Learning to cooperate in distributed partially observable environments with no communication abilities poses significant challenges for multi-agent deep reinforcement learning (MARL). This paper addresses key concerns in this domain, focusing on inferring state representations from individual agent observations and leveraging these representations to enhance agents' exploration and collaborative task execution policies. To this end, we propose a novel state modelling framework for cooperative MARL, where agents infer meaningful belief representations of the non-observable state, with respect to optimizing their own policies, while filtering redundant and less informative joint state information. Building upon this framework, we propose the MARL SMPE algorithm. In SMPE, agents enhance their own policy's discriminative abilities under partial observability, explicitly by incorporating their beliefs into the policy network, and implicitly by adopting an adversarial type of exploration policies which encourages agents to discover novel, high-value states while improving the discriminative abilities of others. Experimentally, we show that SMPE outperforms state-of-the-art MARL algorithms in complex fully cooperative tasks from the MPE, LBF, and RWARE benchmarks.

LGOct 18, 2024
Transfer Reinforcement Learning in Heterogeneous Action Spaces using Subgoal Mapping

Kavinayan P. Sivakumar, Yan Zhang, Zachary Bell et al.

In this paper, we consider a transfer reinforcement learning problem involving agents with different action spaces. Specifically, for any new unseen task, the goal is to use a successful demonstration of this task by an expert agent in its action space to enable a learner agent learn an optimal policy in its own different action space with fewer samples than those required if the learner was learning on its own. Existing transfer learning methods across different action spaces either require handcrafted mappings between those action spaces provided by human experts, which can induce bias in the learning procedure, or require the expert agent to share its policy parameters with the learner agent, which does not generalize well to unseen tasks. In this work, we propose a method that learns a subgoal mapping between the expert agent policy and the learner agent policy. Since the expert agent and the learner agent have different action spaces, their optimal policies can have different subgoal trajectories. We learn this subgoal mapping by training a Long Short Term Memory (LSTM) network for a distribution of tasks and then use this mapping to predict the learner subgoal sequence for unseen tasks, thereby improving the speed of learning by biasing the agent's policy towards the predicted learner subgoal sequence. Through numerical experiments, we demonstrate that the proposed learning scheme can effectively find the subgoal mapping underlying the given distribution of tasks. Moreover, letting the learner agent imitate the expert agent's policy with the learnt subgoal mapping can significantly improve the sample efficiency and training time of the learner agent in unseen new tasks.

SYJun 22, 2021
Failing with Grace: Learning Neural Network Controllers that are Boundedly Unsafe

Panagiotis Vlantis, Leila J. Bridgeman, Michael M. Zavlanos

In this work, we consider the problem of learning a feed-forward neural network controller to safely steer an arbitrarily shaped planar robot in a compact and obstacle-occluded workspace. Unlike existing methods that depend strongly on the density of data points close to the boundary of the safe state space to train neural network controllers with closed-loop safety guarantees, here we propose an alternative approach that lifts such strong assumptions on the data that are hard to satisfy in practice and instead allows for graceful safety violations, i.e., of a bounded magnitude that can be spatially controlled. To do so, we employ reachability analysis techniques to encapsulate safety constraints in the training process. Specifically, to obtain a computationally efficient over-approximation of the forward reachable set of the closed-loop system, we partition the robot's state space into cells and adaptively subdivide the cells that contain states which may escape the safe set under the trained control law. Then, using the overlap between each cell's forward reachable set and the set of infeasible robot configurations as a measure for safety violations, we introduce appropriate terms into the loss function that penalize this overlap in the training process. As a result, our method can learn a safe vector field for the closed-loop system and, at the same time, provide worst-case bounds on safety violation over the whole configuration space, defined by the overlap between the over-approximation of the forward reachable set of the closed-loop system and the set of unsafe states. Moreover, it can control the tradeoff between computational complexity and tightness of these bounds. Our proposed method is supported by both theoretical results and simulation studies.

LGJun 7, 2021
Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning

Chenyu Liu, Yan Zhang, Yi Shen et al.

In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. For example, the context can represent the mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data. Then, our goal is to use the context-aware expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. Such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner's policy faster compared to existing imitation learning methods and enjoys much lower variance during training.

SYMar 8, 2021
Formal Verification of Stochastic Systems with ReLU Neural Network Controllers

Shiqi Sun, Yan Zhang, Xusheng Luo et al.

In this work, we address the problem of formal safety verification for stochastic cyber-physical systems (CPS) equipped with ReLU neural network (NN) controllers. Our goal is to find the set of initial states from where, with a predetermined confidence, the system will not reach an unsafe configuration within a specified time horizon. Specifically, we consider discrete-time LTI systems with Gaussian noise, which we abstract by a suitable graph. Then, we formulate a Satisfiability Modulo Convex (SMC) problem to estimate upper bounds on the transition probabilities between nodes in the graph. Using this abstraction, we propose a method to compute tight bounds on the safety probabilities of nodes in this graph, despite possible over-approximations of the transition probabilities between these nodes. Additionally, using the proposed SMC formula, we devise a heuristic method to refine the abstraction of the system in order to further improve the estimated safety bounds. Finally, we corroborate the efficacy of the proposed method with simulation results considering a robot navigation example and comparison against a state-of-the-art verification scheme.

AIFeb 8, 2021
Learning Optimal Strategies for Temporal Tasks in Stochastic Games

Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos et al.

Synthesis from linear temporal logic (LTL) specifications provides assured controllers for systems operating in stochastic and potentially adversarial environments. Automatic synthesis tools, however, require a model of the environment to construct controllers. In this work, we introduce a model-free reinforcement learning (RL) approach to derive controllers from given LTL specifications even when the environment is completely unknown. We model the problem as a stochastic game (SG) between the controller and the adversarial environment; we then learn optimal control strategies that maximize the probability of satisfying the LTL specifications against the worst-case environment behavior. We first construct a product game using the deterministic parity automaton (DPA) translated from the given LTL specification. By deriving distinct rewards and discount factors from the acceptance condition of the DPA, we reduce the maximization of the worst-case probability of satisfying the LTL specification into the maximization of a discounted reward objective in the product game; this enables the use of model-free RL algorithms to learn an optimal controller strategy. To deal with the common scalability problems when the number of sets defining the acceptance condition of the DPA (usually referred as colors), is large, we propose a lazy color generation method where distinct rewards and discount factors are utilized only when needed, and an approximate method where the controller eventually focuses on only one color. In several case studies, we show that our approach is scalable to a wide range of LTL formulas, significantly outperforming existing methods for learning controllers from LTL specifications in SGs.

ROJan 14, 2021
Temporal Logic Task Allocation in Heterogeneous Multi-Robot Systems

Xusheng Luo, Michael M. Zavlanos

In this paper, we consider the problem of optimally allocating tasks, expressed as global Linear Temporal Logic (LTL) specifications, to teams of heterogeneous mobile robots. The robots are classified in different types that capture their different capabilities, and each task may require robots of multiple types. The specific robots assigned to each task are immaterial, as long as they are of the desired type. Given a discrete workspace, our goal is to design paths, i.e., sequences of discrete states, for the robots so that the LTL specification is satisfied. To obtain a scalable solution to this complex temporal logic task allocation problem, we propose a hierarchical approach that first allocates specific robots to tasks using the information about the tasks contained in the Nondeterministic Buchi Automaton (NBA) that captures the LTL specification, and then designs low-level executable plans for the robots that respect the high-level assignment. Specifically, we first prune and relax the NBA by removing all negative atomic propositions. This step is motivated by "lazy collision checking" methods in robotics and allows to simplify the planning problem by checking constraint satisfaction only when needed. Then, we extract sequences of subtasks from the relaxed NBA along with their temporal orders, and formulate a Mixed Integer Linear Program (MILP) to allocate these subtasks to the robots. Finally, we define generalized multi-robot path planning problems to obtain low-level executable robot plans that satisfy both the high-level task allocation and the temporal constraints captured by the negative atomic propositions in the original NBA. We show that our method is complete for a subclass of LTL that covers a broad range of tasks and present numerical simulations demonstrating that it can generate paths with lower cost, considerably faster than existing methods.

MED-PHDec 8, 2020
Plane Wave Elastography: A Frequency-Domain Ultrasound Shear Wave Elastography Approach

Reza Khodayi-mehr, Matthew W. Urban, Michael M. Zavlanos et al.

In this paper, we propose Plane Wave Elastography (PWE), a novel ultrasound shear wave elastography (SWE) approach. Currently, commercial methods for SWE rely on directional filtering based on the prior knowledge of the wave propagation direction, to remove complicated wave patterns formed due to reflection and refraction. The result is a set of decomposed directional waves that are separately analyzed to construct shear modulus fields that are then combined through compounding. Instead, PWE relies on a rigorous representation of the wave propagation using the frequency-domain scalar wave equation to automatically select appropriate propagation directions and simultaneously reconstruct shear modulus fields. Specifically, assuming a homogeneous, isotropic, incompressible, linear-elastic medium, we represent the solution of the wave equation using a linear combination of plane waves propagating in arbitrary directions. Given this closed-form solution, we formulate the SWE problem as a nonlinear least-squares optimization problem which can be solved very efficiently. Through numerous phantom studies, we show that PWE can handle complicated waveforms without prior filtering and is competitive with state-of-the-art that requires prior filtering based on the knowledge of propagation directions.

RONov 3, 2020
Human-in-the-Loop Robot Planning with Non-Contextual Bandit Feedback

Yijie Zhou, Yan Zhang, Xusheng Luo et al.

In this paper, we consider a robot navigation problem in environments populated by humans. The goal is to determine collision-free and dynamically feasible trajectories that also maximize human satisfaction. This is because they may drive the robot close to humans that need help with their work or because they may keep the robot away from humans when it can interfere with human sight or work. In practice, human satisfaction is subjective and hard to describe mathematically. As a result, the planning problem we consider in this paper may lack important contextual information. To address this challenge, we propose a semi-supervised Bayesian Optimization (BO) method to design globally optimal robot trajectories using non-contextual bandit human feedback in the form of complaints or satisfaction ratings that express how satisfactory a trajectory is, without revealing the reason. Since trajectory planning is typically a high-dimensional optimization problem in the space of waypoints that define a trajectory, BO may require prohibitively many queries for human feedback to return a good solution. To this end, we use an autoencoder to reduce the high-dimensional problem space into a low dimensional latent space, which we update using human feedback. Moreover, we improve the exploration efficiency of BO by biasing the search for new trajectories towards dynamically feasible and collision-free trajectories obtained using off-the-shelf motion planners. We demonstrate the efficiency of our proposed trajectory planning method in a scenario with humans that have diversified and unknown demands.

LGOct 14, 2020
Boosting One-Point Derivative-Free Online Optimization via Residual Feedback

Yan Zhang, Yi Zhou, Kaiyi Ji et al.

Zeroth-order optimization (ZO) typically relies on two-point feedback to estimate the unknown gradient of the objective function. Nevertheless, two-point feedback can not be used for online optimization of time-varying objective functions, where only a single query of the function value is possible at each time step. In this work, we propose a new one-point feedback method for online optimization that estimates the objective function gradient using the residual between two feedback points at consecutive time instants. Moreover, we develop regret bounds for ZO with residual feedback for both convex and nonconvex online optimization problems. Specifically, for both deterministic and stochastic problems and for both Lipschitz and smooth objective functions, we show that using residual feedback can produce gradient estimates with much smaller variance compared to conventional one-point feedback methods. As a result, our regret bounds are much tighter compared to existing regret bounds for ZO with conventional one-point feedback, which suggests that ZO with residual feedback can better track the optimizer of online optimization problems. Additionally, our regret bounds rely on weaker assumptions than those used in conventional one-point feedback methods. Numerical experiments show that ZO with residual feedback significantly outperforms existing one-point feedback methods also in practice.

LGJun 18, 2020
Cooperative Multi-Agent Reinforcement Learning with Partial Observations

Yan Zhang, Michael M. Zavlanos

In this paper, we propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL). Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the proposed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, to calculate the local policy gradients, we develop a new distributed zeroth-order policy gradient estimator that relies on one-point residual-feedback which, compared to existing zeroth-order estimators that also rely on one-point feedback, significantly reduces the variance of the policy gradient estimates improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to the neighborhood of a policy that is a stationary point of the global objective function. The size of this neighborhood depends on the agents' learning rates, the exploration parameters, and the number of consensus steps used to calculate the local estimates of the global accumulated rewards. Moreover, we provide numerical experiments that demonstrate that our new zeroth-order policy gradient estimator is more sample-efficient compared to other existing one-point estimators.

OCJun 18, 2020
A New One-Point Residual-Feedback Oracle For Black-Box Learning and Control

Yan Zhang, Yi Zhou, Kaiyi Ji et al.

Zeroth-order optimization (ZO) algorithms have been recently used to solve black-box or simulation-based learning and control problems, where the gradient of the objective function cannot be easily computed but can be approximated using the objective function values. Many existing ZO algorithms adopt two-point feedback schemes due to their fast convergence rate compared to one-point feedback schemes. However, two-point schemes require two evaluations of the objective function at each iteration, which can be impractical in applications where the data are not all available a priori, e.g., in online optimization. In this paper, we propose a novel one-point feedback scheme that queries the function value once at each iteration and estimates the gradient using the residual between two consecutive points. When optimizing a deterministic Lipschitz function, we show that the query complexity of ZO with the proposed one-point residual feedback matches that of ZO with the existing two-point schemes. Moreover, the query complexity of the proposed algorithm can be improved when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization problems where only noisy objective function values are given, we show that ZO with one-point residual feedback achieves the same convergence rate as that of two-point scheme with uncontrollable data samples. We demonstrate the effectiveness of the proposed one-point residual feedback via extensive numerical experiments.

LGMar 9, 2020
Transfer Reinforcement Learning under Unobserved Contextual Information

Yan Zhang, Michael M. Zavlanos

In this paper, we study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context. Specifically, we consider a demonstrator agent that has access to a context-aware policy and can generate transition and reward data based on that policy. These data constitute the experience of the demonstrator. Then, the goal is to transfer this experience, excluding the underlying contextual information, to a learner agent that does not have access to the environmental context, so that they can learn a control policy using fewer samples. It is well known that, disregarding the causal effect of the contextual information, can introduce bias in the transition and reward models estimated by the learner, resulting in a learned suboptimal policy. To address this challenge, in this paper, we develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data, which we then use to obtain causal bounds on the value functions. Using these value function bounds, we propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias. We provide numerical experiments for robot motion planning problems that validate the proposed value function bounds and demonstrate that the proposed algorithms can effectively make use of the data from the demonstrator to accelerate the learning process of the learner.

ROMar 2, 2020
Socially-Aware Robot Planning via Bandit Human Feedback

Xusheng Luo, Yan Zhang, Michael M. Zavlanos

In this paper, we consider the problem of designing collision-free, dynamically feasible, and socially-aware trajectories for robots operating in environments populated by humans. We define trajectories to be social-aware if they do not interfere with humans in any way that causes discomfort. In this paper, discomfort is defined broadly and, depending on specific individuals, it can result from the robot being too close to a human or from interfering with human sight or tasks. Moreover, we assume that human feedback is a bandit feedback indicating a complaint or no complaint on the part of the robot trajectory that interferes with the humans, and it does not reveal any contextual information about the locations of the humans or the reason for a complaint. Finally, we assume that humans can move in the obstacle-free space and, as a result, human utility can change. We formulate this planning problem as an online optimization problem that minimizes the social value of the time-varying robot trajectory, defined by the total number of incurred human complaints. As the human utility is unknown, we employ zeroth order, or derivative-free, optimization methods to solve this problem, which we combine with off-the-shelf motion planners to satisfy the dynamic feasibility and collision-free specifications of the resulting trajectories. To the best of our knowledge, this is a new framework for socially-aware robot planning that is not restricted to avoiding collisions with humans but, instead, focuses on increasing the social value of the robot trajectories using only bandit human feedback.

LGDec 16, 2019
VarNet: Variational Neural Networks for the Solution of Partial Differential Equations

Reza Khodayi-Mehr, Michael M. Zavlanos

In this paper we propose a new model-based unsupervised learning method, called VarNet, for the solution of partial differential equations (PDEs) using deep neural networks (NNs). Particularly, we propose a novel loss function that relies on the variational (integral) form of PDEs as apposed to their differential form which is commonly used in the literature. Our loss function is discretization-free, highly parallelizable, and more effective in capturing the solution of PDEs since it employs lower-order derivatives and trains over measure non-zero regions of space-time. Given this loss function, we also propose an approach to optimally select the space-time samples, used to train the NN, that is based on the feedback provided from the PDE residual. The models obtained using VarNet are smooth and do not require interpolation. They are also easily differentiable and can directly be used for control and optimization of PDEs. Finally, VarNet can straight-forwardly incorporate parametric PDE models making it a natural tool for model order reduction (MOR) of PDEs. We demonstrate the performance of our method through extensive numerical experiments for the advection-diffusion PDE as an important case-study.

OCNov 12, 2019
A Distributed Online Convex Optimization Algorithm with Improved Dynamic Regret

Yan Zhang, Robert J. Ravier, Michael M. Zavlanos et al.

In this paper, we consider the problem of distributed online convex optimization, where a network of local agents aim to jointly optimize a convex function over a period of multiple time steps. The agents do not have any information about the future. Existing algorithms have established dynamic regret bounds that have explicit dependence on the number of time steps. In this work, we show that we can remove this dependence assuming that the local objective functions are strongly convex. More precisely, we propose a gradient tracking algorithm where agents jointly communicate and descend based on corrected gradient steps. We verify our theoretical results through numerical experiments.

ROSep 16, 2019
Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning

Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos et al.

We present a reinforcement learning (RL) framework to synthesize a control policy from a given linear temporal logic (LTL) specification in an unknown stochastic environment that can be modeled as a Markov Decision Process (MDP). Specifically, we learn a policy that maximizes the probability of satisfying the LTL formula without learning the transition probabilities. We introduce a novel rewarding and path-dependent discounting mechanism based on the LTL formula such that (i) an optimal policy maximizing the total discounted reward effectively maximizes the probabilities of satisfying LTL objectives, and (ii) a model-free RL algorithm using these rewards and discount factors is guaranteed to converge to such policy. Finally, we illustrate the applicability of our RL-based synthesis approach on two motion planning case studies.

ROSep 2, 2019
An Abstraction-Free Method for Multi-Robot Temporal Logic Optimal Control Synthesis

Xusheng Luo, Yiannis Kantaros, Michael M. Zavlanos

The majority of existing Linear Temporal Logic (LTL) planning methods rely on the construction of a discrete product automaton, that combines a discrete abstraction of robot mobility and a B$\ddot{\text{u}}$chi automaton that captures the LTL specification. Representing this product automaton as a graph and using graph search techniques, optimal plans that satisfy the LTL task can be synthesized. However, constructing expressive discrete abstractions makes the synthesis problem computationally intractable. In this paper, we propose a new sampling-based LTL planning algorithm that does not require any discrete abstraction of robot mobility. Instead, it incrementally builds trees that explore the product state-space, until a maximum number of iterations is reached or a feasible plan is found. The use of trees makes data storage and graph search tractable, which significantly increases the scalability of our algorithm. To accelerate the construction of feasible plans, we introduce bias in the sampling process which is guided by transitions in the B$\ddot{\text{u}}$chi automaton that belong to the shortest path to the accepting states. We show that our planning algorithm, with and without bias, is probabilistically complete and asymptotically optimal. Finally, we present numerical experiments showing that our method outperforms relevant temporal logic planning methods.

OCMar 25, 2019
An Optimal Graph-Search Method for Secure State Estimation

Xusheng Luo, Miroslav Pajic, Michael M. Zavlanos

The growing complexity of modern Cyber-Physical Systems (CPS) and the frequent communication between their components make them vulnerable to malicious attacks. As a result, secure state estimation is a critical requirement for the control of these systems. Many existing secure state estimation methods suffer from combinatorial complexity which grows with the number of states and sensors in the system. This complexity can be mitigated using optimization-based methods that relax the original state estimation problem, although at the cost of optimality as these methods often identify attack-free sensors as attacked. In this paper, we propose a new optimal graph-search algorithm to correctly identify malicious attacks and to securely estimate the states even in large-scale CPS modeled as linear time-invariant systems. The graph consists of layers, each one containing two nodes capturing a truth assignment of any given sensor, and directed edges connecting adjacent layers only. Then, our algorithm searches the layers of this graph incrementally, favoring directions at higher layers with more attack-free assignments, while actively managing a repository of nodes to be expanded at later iterations. The proposed search bias and the ability to revisit nodes in the repository and self-correct, allow our graph-search algorithm to reach the optimal assignment faster and tackle larger problems. We show that our algorithm is complete and optimal provided that process and measurement noises do not dominate the attack signal. Moreover, we provide numerical simulations that demonstrate the ability of our algorithm to correctly identify attacked sensors and securely reconstruct the state. Our simulations show that our method outperforms existing algorithms both in terms of optimality and execution time.

LGMar 21, 2019
Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus

Yan Zhang, Michael M. Zavlanos

In this paper, we propose a distributed off-policy actor critic method to solve multi-agent reinforcement learning problems. Specifically, we assume that all agents keep local estimates of the global optimal policy parameter and update their local value function estimates independently. Then, we introduce an additional consensus step to let all the agents asymptotically achieve agreement on the global optimal policy function. The convergence analysis of the proposed algorithm is provided and the effectiveness of the proposed algorithm is validated using a distributed resource allocation example. Compared to relevant distributed actor critic methods, here the agents do not share information about their local tasks, but instead they coordinate to estimate the global policy function.

RODec 11, 2018
Deep Learning for Robotic Mass Transport Cloaking

Reza Khodayi-mehr, Michael M. Zavlanos

We consider the problem of mass transport cloaking using mobile robots. The robots move along a predefined curve that encloses a safe zone and carry sources that collectively counteract a chemical agent released in the environment. The goal is to steer the mass flux around a desired region so that it remains unaffected by the external concentration. We formulate the problem of controlling the robot positions and release rates as a PDE-constrained optimization, where the propagation of the chemical is modeled by the advection-diffusion (AD) PDE. We use a neural network (NN) to approximate the solution of the PDE. Particularly, we propose a novel loss function for the NN that utilizes the variational form of the AD-PDE and allows us to reformulate the planning problem as an unsupervised model-based learning problem. Our loss function is discretization-free and highly parallelizable. Unlike passive cloaking methods that use metamaterials to steer the mass flux, our method is the first to use mobile robots to actively control the concentration levels and create safe zones independent of environmental conditions. We demonstrate the performance of our method in simulations.

RODec 10, 2018
Physics-Based Learning for Robotic Environmental Sensing

Reza Khodayi-mehr, Michael M. Zavlanos

We propose a physics-based method to learn environmental fields (EFs) using a mobile robot. Common purely data-driven methods require prohibitively many measurements to accurately learn such complex EFs. Alternatively, physics-based models provide global knowledge of EFs but require experimental validation, depend on uncertain parameters, and are intractable for mobile robots. To address these challenges, we propose a Bayesian framework to select the most likely physics-based models of EFs in real-time, from a pool of numerical solutions generated offline as a function of the uncertain parameters. Specifically, we focus on turbulent flow fields and utilize Gaussian processes (GPs) to construct statistical models for them, using the pool of numerical solutions to inform their prior mean. To incorporate flow measurements into these GPs, we control a custom-built mobile robot through a sequence of waypoints that maximize the information content of the measurements. We experimentally demonstrate that our proposed framework constructs a posterior distribution of the flow field that better approximates the real flow compared to the prior numerical solutions and purely data-driven methods.

ROSep 21, 2018
STyLuS*: A Temporal Logic Optimal Control Synthesis Algorithm for Large-Scale Multi-Robot Systems

Yiannis Kantaros, Michael M. Zavlanos

This paper proposes a new highly scalable and asymptotically optimal control synthesis algorithm from linear temporal logic specifications, called $\text{STyLuS}^{*}$ for large-Scale optimal Temporal Logic Synthesis, that is designed to solve complex temporal planning problems in large-scale multi-robot systems. Existing planning approaches with temporal logic specifications rely on graph search techniques applied to a product automaton constructed among the robots. In our previous work, we have proposed a more tractable sampling-based algorithm that builds incrementally trees that approximate the state-space and transitions of the synchronous product automaton and does not require sophisticated graph search techniques. Here, we extend our previous work by introducing bias in the sampling process which is guided by transitions in the B$\ddot{\text{u}}$chi automaton that belong to the shortest path to the accepting states. This allows us to synthesize optimal motion plans from product automata with hundreds of orders of magnitude more states than those that existing optimal control synthesis methods or off-the-shelf model checkers can manipulate. We show that $\text{STyLuS}^{*}$ is probabilistically complete and asymptotically optimal and has exponential convergence rate. This is the first time that convergence rate results are provided for sampling-based optimal control synthesis methods. We provide simulation results that show that $\text{STyLuS}^{*}$ can synthesize optimal motion plans for very large multi-robot systems which is impossible using state-of-the-art methods.

ROMay 3, 2018
Distributed State Estimation Using Intermittently Connected Robot Networks

Reza Khodayi-mehr, Yiannis Kantaros, Michael M. Zavlanos

This paper considers the problem of distributed state estimation using multi-robot systems. The robots have limited communication capabilities and, therefore, communicate their measurements intermittently only when they are physically close to each other. To decrease the distance that the robots need to travel only to communicate, we divide them into small teams that can communicate at different locations to share information and update their beliefs. Then, we propose a new distributed scheme that combines (i) communication schedules that ensure that the network is intermittently connected, and (ii) sampling-based motion planning for the robots in every team with the objective to collect optimal measurements and decide a location for those robots to communicate. To the best of our knowledge, this is the first distributed state estimation framework that relaxes all network connectivity assumptions, and controls intermittent communication events so that the estimation uncertainty is minimized. We present simulation results that demonstrate significant improvement in estimation accuracy compared to methods that maintain an end-to-end connected network for all time.

ROApr 29, 2018
Control of Magnetic Microrobot Teams for Temporal Micromanipulation Tasks

Yiannis Kantaros, Benjamin Johnson, Sagar Chowdhury et al.

In this paper, we present a control framework that allows magnetic microrobot teams to accomplish complex micromanipulation tasks captured by global Linear Temporal Logic (LTL) formulas. To address this problem, we propose an optimal control synthesis method that constructs discrete plans for the robots that satisfy both the assigned tasks as well as proximity constraints between the robots due to the physics of the problem. Our proposed algorithm relies on an existing optimal control synthesis approach combined with a novel sampling-based technique to reduce the state-space of the product automaton that is associated with the LTL specifications. The synthesized discrete plans are executed by the microrobots independently using local magnetic fields. Simulation studies show that the proposed algorithm can address large-scale planning problems that cannot be solved using existing optimal control synthesis approaches. Moreover, we present experimental results that also illustrate the potential of our method in practice. To the best of our knowledge, this is the first control framework that allows independent control of teams of magnetic microrobots for temporal micromanipulation tasks.

ROJun 16, 2017
Probabilistic Motion Planning under Temporal Tasks and Soft Constraints

Meng Guo, Michael M. Zavlanos

This paper studies motion planning of a mobile robot under uncertainty. The control objective is to synthesize a {finite-memory} control policy, such that a high-level task specified as a Linear Temporal Logic (LTL) formula is satisfied with a desired high probability. Uncertainty is considered in the workspace properties, robot actions, and task outcomes, giving rise to a Markov Decision Process (MDP) that models the proposed system. Different from most existing methods, we consider cost optimization both in the prefix and suffix of the system trajectory. We also analyze the potential trade-off between reducing the mean total cost and maximizing the probability that the task is satisfied. The proposed solution is based on formulating two coupled Linear Programs, for the prefix and suffix, respectively, and combining them into a multi-objective optimization problem, which provides provable guarantees on the probabilistic satisfiability and the total cost optimality. We show that our method outperforms relevant approaches that employ Round-Robin policies in the trajectory suffix. Furthermore, we propose a new control synthesis algorithm to minimize the frequency of reaching a bad state when the probability of satisfying the tasks is zero, in which case most existing methods return no solution. We validate the above schemes via both numerical simulations and experimental studies.

ROJun 13, 2017
Sampling-Based Optimal Control Synthesis for Multi-Robot Systems under Global Temporal Tasks

Yiannis Kantaros, Michael M. Zavlanos

This paper proposes a new optimal control synthesis algorithm for multi-robot systems under global temporal logic tasks. Existing planning approaches under global temporal goals rely on graph search techniques applied to a product automaton constructed among the robots. In this paper, we propose a new sampling-based algorithm that builds incrementally trees that approximate the state-space and transitions of the synchronous product automaton. By approximating the product automaton by a tree rather than representing it explicitly, we require much fewer memory resources to store it and motion plans can be found by tracing sequences of parent nodes without the need for sophisticated graph search methods. This significantly increases the scalability of our algorithm compared to existing optimal control synthesis methods. We also show that the proposed algorithm is probabilistically complete and asymptotically optimal. Finally, we present numerical experiments showing that our approach can synthesize optimal plans from product automata with billions of states, which is not possible using standard optimal control synthesis algorithms or off-the-shelf model checkers.

ROJun 6, 2017
Controlling a Robotic Stereo Camera Under Image Quantization Noise

Charles Freundlich, Yan Zhang, Alex Zihao Zhu et al.

In this paper, we address the problem of controlling a mobile stereo camera under image quantization noise. Assuming that a pair of images of a set of targets is available, the camera moves through a sequence of Next-Best-Views (NBVs), i.e., a sequence of views that minimize the trace of the targets' cumulative state covariance, constructed using a realistic model of the stereo rig that captures image quantization noise and a Kalman Filter (KF) that fuses the observation history with new information. The proposed algorithm decomposes control into two stages: first the NBV is computed in the camera relative coordinates, and then the camera moves to realize this view in the fixed global coordinate frame. This decomposition allows the camera to drive to a new pose that effectively realizes the NBV in camera coordinates while satisfying Field-of-View constraints in global coordinates, a task that is particularly challenging using complex sensing models. We provide simulations and real experiments that illustrate the ability of the proposed mobile camera system to accurately localize sets of targets. We also propose a novel data-driven technique to characterize unmodeled uncertainty, such as calibration errors, at the pixel level and show that this method ensures stability of the KF.

ROJun 6, 2017
Distributed Hierarchical Control for State Estimation With Robotic Sensor Networks

Charles Freundlich, Yan Zhang, Michael M. Zavlanos

This paper addresses active state estimation with a team of robotic sensors. The states to be estimated are represented by spatially distributed, uncorrelated, stationary vectors. Given a prior belief on the geographic locations of the states, we cluster the states in moderately sized groups and propose a new hierarchical Dynamic Programming (DP) framework to compute optimal sensing policies for each cluster that mitigates the computational cost of planning optimal policies in the combined belief space. Then, we develop a decentralized assignment algorithm that dynamically allocates clusters to robots based on the pre-computed optimal policies at each cluster. The integrated distributed state estimation framework is optimal at the cluster level but also scales very well to large numbers of states and robot sensors. We demonstrate efficiency of the proposed method in both simulations and real-world experiments using stereoscopic vision sensors.

ROJun 6, 2017
Model-Based Active Source Identification in Complex Environments

Reza Khodayi-mehr, Wilkins Aquino, Michael M. Zavlanos

We consider the problem of Active Source Identification (ASI) in steady-state Advection-Diffusion (AD) transport systems. Unlike existing bio-inspired heuristic methods, we propose a model-based method that employs the AD-PDE to capture the transport phenomenon. Specifically, we formulate the Source Identification (SI) problem as a PDE-constrained optimization problem in function spaces. To obtain a tractable solution, we reduce the dimension of the concentration field using Proper Orthogonal Decomposition and approximate the unknown source field using nonlinear basis functions, drastically decreasing the number of unknowns. Moreover, to collect the concentration measurements, we control a robot sensor through a sequence of waypoints that maximize the smallest eigenvalue of the Fisher Information matrix of the unknown source parameters. Specifically, after every new measurement, a SI problem is solved to obtain a source estimate that is used to determine the next waypoint. We show that our algorithm can efficiently identify sources in complex AD systems and non-convex domains, in simulation and experimentally. This is the first time that PDEs are used for robotic SI in practice.

ROJun 6, 2017
Distributed Active State Estimation with User-Specified Accuracy

Charles Freundlich, Soomin Lee, Michael M. Zavlanos

In this paper, we address the problem of controlling a network of mobile sensors so that a set of hidden states are estimated up to a user-specified accuracy. The sensors take measurements and fuse them online using an Information Consensus Filter (ICF). At the same time, the local estimates guide the sensors to their next best configuration. This leads to an LMI-constrained optimization problem that we solve by means of a new distributed random approximate projections method. The new method is robust to the state disagreement errors that exist among the robots as the ICF fuses the collected measurements. Assuming that the noise corrupting the measurements is zero-mean and Gaussian and that the robots are self localized in the environment, the integrated system converges to the next best positions from where new observations will be taken. This process is repeated with the robots taking a sequence of observations until the hidden states are estimated up to the desired user-specified accuracy. We present simulations of sparse landmark localization, where the robotic team achieves the desired estimation tolerances while exhibiting interesting emergent behavior.