SYSep 2, 2014
Signaling for Decentralized Routing in a Queueing NetworkYi Ouyang, Demosthenis Teneketzis
A discrete-time decentralized routing problem in a service system consisting of two service stations and two controllers is investigated. Each controller is affiliated with one station. Each station has an infinite size buffer. Exogenous customer arrivals at each station occur with rate $λ$. Service times at each station have rate $μ$. At any time, a controller can route one of the customers waiting in its own station to the other station. Each controller knows perfectly the queue length in its own station and observes the exogenous arrivals to its own station as well as the arrivals of customers sent from the other station. At the beginning, each controller has a probability mass function (PMF) on the number of customers in the other station. These PMFs are common knowledge between the two controllers. At each time a holding cost is incurred at each station due to the customers waiting at that station. The objective is to determine routing policies for the two controllers that minimize either the total expected holding cost over a finite horizon or the average cost per unit time over an infinite horizon. In this problem there is implicit communication between the two controllers; whenever a controller decides to send or not to send a customer from its own station to the other station it communicates information about its queue length to the other station. This implicit communication through control actions is referred to as signaling in decentralized control. Signaling results in complex communication and decision problems. In spite of the complexity of signaling involved, it is shown that an optimal signaling strategy is described by a threshold policy which depends on the common information between the two controllers; this threshold policy is explicitly determined.
SYJun 17, 2018
Optimal Local and Remote Controllers with Unreliable Uplink ChannelsSeyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
We consider a networked control system consisting of a remote controller and a collection of linear plants, each associated with a local controller. Each local controller directly observes the state of its co-located plant and can inform the remote controller of the plant's state through an unreliable uplink channel. We assume that the downlink channels from the remote controller to local controllers are perfect. The objective of the local controllers and the remote controller is to cooperatively minimize a quadratic performance cost. We provide a dynamic program for this decentralized control problem using the common information approach. Although our problem is not a partially nested problem, we obtain explicit optimal strategies for all controllers. In the optimal strategies, all controllers compute common estimates of the states of the plants based on the common information obtained from the communication network. The remote controller's action is linear in the common state estimates, and the action of each local controller is linear in both the actual state of its co-located plant and the common state estimates. We illustrate our results with numerical experiments using randomly generated models.
SYJun 18, 2018
Optimal Infinite Horizon Decentralized Networked Controllers with Unreliable CommunicationYi Ouyang, Seyed Mohammad Asghari, Ashutosh Nayyar
We consider a decentralized networked control system (DNCS) consisting of a remote controller and a collection of linear plants, each associated with a local controller. Each local controller directly observes the state of its co-located plant and can inform the remote controller of the plant's state through an unreliable uplink channel. The downlink channels from the remote controller to local controllers were assumed to be perfect. The objective of the local controllers and the remote controller is to cooperatively minimize the infinite horizon time average of expected quadratic cost. The finite horizon version of this problem was solved in our prior work [2]. The optimal strategies in the finite horizon case were shown to be characterized by coupled Riccati recursions. In this paper, we show that if the link failure probabilities are below certain critical thresholds, then the coupled Riccati recursions of the finite horizon solution reach a steady state and the corresponding decentralized strategies are optimal. Above these thresholds, we show that no strategy can achieve finite cost. We exploit a connection between our DNCS Riccati recursions and the coupled Riccati recursions of an auxiliary Markov jump linear system to obtain our results. Our main results in Theorems 1 and 2 explicitly identify the critical thresholds for the link failure probabilities and the optimal decentralized control strategies when all link failure probabilities are below their thresholds.
SYJun 23, 2016
Optimal Local and Remote Controllers with Unreliable CommunicationYi Ouyang, Seyed Mohammad Asghari, Ashutosh Nayyar
We consider a decentralized optimal control problem for a linear plant controlled by two controllers, a local controller and a remote controller. The local controller directly observes the state of the plant and can inform the remote controller of the plant state through a packet-drop channel. We assume that the remote controller is able to send acknowledgments to the local controller to signal the successful receipt of transmitted packets. The objective of the two controllers is to cooperatively minimize a quadratic performance cost. We provide a dynamic program for this decentralized control problem using the common information approach. Although our problem is not a partially nested LQG problem, we obtain explicit optimal strategies for the two controllers. In the optimal strategies, both controllers compute a common estimate of the plant state based on the common information. The remote controller's action is linear in the common estimated state, and the local controller's action is linear in both the actual state and the common estimated state.
SYFeb 9, 2019
Worst-case Guarantees for Remote Estimation of an Uncertain SourceMukul Gagrani, Yi Ouyang, Mohammad Rasouli et al.
Consider a remote estimation problem where a sensor wants to communicate the state of an uncertain source to a remote estimator over a finite time horizon. The uncertain source is modeled as an autoregressive process with bounded noise. Given that the sensor has a limited communication budget, the sensor must decide when to transmit the state to the estimator who has to produce real-time estimates of the source state. In this paper, we consider the problem of finding a scheduling strategy for the sensor and an estimation strategy for the estimator to jointly minimize the worst-case maximum instantaneous estimation error over the time horizon. This leads to a decentralized minimax decision-making problem. We obtain a complete characterization of optimal strategies for this decentralized minimax problem. In particular, we show that an open loop communication scheduling strategy is optimal and the optimal estimate depends only on the most recently received sensor observation.
SYFeb 2, 2018
Decentralized Control of Stochastically Switched Linear System with Unreliable CommunicationSeyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
We consider a networked control system (NCS) consisting of two plants, a global plant and a local plant, and two controllers, a global controller and a local controller. The global (resp. local) plant follows discrete-time stochastically switched linear dynamics with a continuous global (resp. local) state and a discrete global (resp. local) mode. We assume that the state and mode of the global plant are observed by both controllers while the state and mode of the local plant are only observed by the local controller. The local controller can inform the global controller of the local plant's state and mode through an unreliable TCP-like communication channel where successful transmissions are acknowledged. The objective of the controllers is to cooperatively minimize a modes-dependent quadratic cost over a finite time horizon. Following the method developed in [1] and [2], we construct a dynamic program based on common information and a decomposition of strategies, and use it to obtain explicit optimal strategies for the controllers. In the optimal strategies, both controllers compute a common estimate of the local plant's state. The global controller's action is linear in the state of the global plant and the common estimated state, and the local controller's action is linear in the actual states of both plants and the common estimated state. Furthermore, the gain matrices for the global controller depend on the global mode and its observation about the local mode, while the gain matrices for the local controller depend on the actual modes of both plants and the global controller's observation about the local mode.
LGAug 15, 2022
Grasping Core Rules of Time Series through Pure ModelsGedi Liu, Yifeng Jiang, Yi Ouyang et al.
Time series underwent the transition from statistics to deep learning, as did many other machine learning fields. Although it appears that the accuracy has been increasing as the model is updated in a number of publicly available datasets, it typically only increases the scale by several times in exchange for a slight difference in accuracy. Through this experiment, we point out a different line of thinking, time series, especially long-term forecasting, may differ from other fields. It is not necessary to use extensive and complex models to grasp all aspects of time series, but to use pure models to grasp the core rules of time series changes. With this simple but effective idea, we created PureTS, a network with three pure linear layers that achieved state-of-the-art in 80% of the long sequence prediction tasks while being nearly the lightest model and having the fastest running speed. On this basis, we discuss the potential of pure linear layers in both phenomena and essence. The ability to understand the core law contributes to the high precision of long-distance prediction, and reasonable fluctuation prevents it from distorting the curve in multi-step prediction like mainstream deep learning models, which is summarized as a pure linear neural network that avoids over-fluctuating. Finally, we suggest the fundamental design standards for lightweight long-step time series tasks: input and output should try to have the same dimension, and the structure avoids fragmentation and complex operations.
LGFeb 20, 2020Code
Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement LearningChao-Han Huck Yang, Jun Qi, Pin-Yu Chen et al.
Recent deep neural networks based techniques, especially those equipped with the ability of self-adaptation in the system level such as deep reinforcement learning (DRL), are shown to possess many advantages of optimizing robot learning systems (e.g., autonomous navigation and continuous robot arm control.) However, the learning-based systems and the associated models may be threatened by the risks of intentionally adaptive (e.g., noisy sensor confusion) and adversarial perturbations from real-world scenarios. In this paper, we introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames. To study the vulnerability of learning-based navigation systems, we propose two adversarial agent models: one refers to online learning; another one is based on evolutionary learning. Besides, three open-source robot learning and navigation control environments are employed to study the vulnerability under adversarial timing attacks. Our experimental results show that the adversarial timing attacks can lead to a significant performance drop, and also suggest the necessity of enhancing the robustness of robot learning systems.
LGDec 24, 2024
Semi-supervised Credit Card Fraud Detection via Attribute-Driven Graph RepresentationSheng Xiang, Mingzhi Zhu, Dawei Cheng et al.
Credit card fraud incurs a considerable cost for both cardholders and issuing banks. Contemporary methods apply machine learning-based classifiers to detect fraudulent behavior from labeled transaction records. But labeled data are usually a small proportion of billions of real transactions due to expensive labeling costs, which implies that they do not well exploit many natural features from unlabeled data. Therefore, we propose a semi-supervised graph neural network for fraud detection. Specifically, we leverage transaction records to construct a temporal transaction graph, which is composed of temporal transactions (nodes) and interactions (edges) among them. Then we pass messages among the nodes through a Gated Temporal Attention Network (GTAN) to learn the transaction representation. We further model the fraud patterns through risk propagation among transactions. The extensive experiments are conducted on a real-world transaction dataset and two publicly available fraud detection datasets. The result shows that our proposed method, namely GTAN, outperforms other state-of-the-art baselines on three fraud detection datasets. Semi-supervised experiments demonstrate the excellent fraud detection performance of our model with only a tiny proportion of labeled data.
CLDec 19, 2023
COOPER: Coordinating Specialized Agents towards a Complex Dialogue GoalYi Cheng, Wenge Liu, Jian Wang et al.
In recent years, there has been a growing interest in exploring dialogues with more complex goals, such as negotiation, persuasion, and emotional support, which go beyond traditional service-focused dialogue systems. Apart from the requirement for much more sophisticated strategic reasoning and communication skills, a significant challenge of these tasks lies in the difficulty of objectively measuring the achievement of their goals in a quantifiable way, making it difficult for existing research to directly optimize the dialogue procedure towards them. In our work, we emphasize the multifaceted nature of complex dialogue goals and argue that it is more feasible to accomplish them by comprehensively considering and jointly promoting their different aspects. To this end, we propose a novel dialogue framework, Cooper, which coordinates multiple specialized agents, each dedicated to a specific dialogue goal aspect separately, to approach the complex objective. Through this divide-and-conquer manner, we make complex dialogue goals more approachable and elicit greater intelligence via the collaboration of individual agents. Experiments on persuasion and emotional support dialogues demonstrate the superiority of our method over a set of competitive baselines.
LGMar 15, 2024
Generation is better than Modification: Combating High Class Homophily Variance in Graph Anomaly DetectionRui Zhang, Dawei Cheng, Xin Liu et al.
Graph-based anomaly detection is currently an important research topic in the field of graph neural networks (GNNs). We find that in graph anomaly detection, the homophily distribution differences between different classes are significantly greater than those in homophilic and heterophilic graphs. For the first time, we introduce a new metric called Class Homophily Variance, which quantitatively describes this phenomenon. To mitigate its impact, we propose a novel GNN model named Homophily Edge Generation Graph Neural Network (HedGe). Previous works typically focused on pruning, selecting or connecting on original relationships, and we refer to these methods as modifications. Different from these works, our method emphasizes generating new relationships with low class homophily variance, using the original relationships as an auxiliary. HedGe samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that HedGe achieved the best performance across multiple benchmark datasets, including anomaly detection and edgeless node classification. The proposed model also improves the robustness under the novel Heterophily Attack with increased class homophily variance on other graph classification tasks.
OCFeb 13, 2024
Model approximation in MDPs with unbounded per-step costBerk Bozkurt, Aditya Mahajan, Ashutosh Nayyar et al.
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hatπ^{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hatπ^\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.
0.4ITApr 9
On quadratic binomial vectorial functions with maximal bent componentsXianhong Xie, Yi Ouyang, Shenxing Zhang
Assume $n=2m\geq 2$ and let $F(x)=x^{d_1}+x^{d_2}$ be a binomial vectorial function over $\F_{2^n}$ possessing the maximal number (i.e. $2^n-2^m$) of bent components. Suppose the $2$-adic Hamming weights $\wt_2(d_1)$ and $\wt_2(d_2)$ are both at most $2$, we prove that $F(x)$ is affine equivalent to either $x^{2^m+1}$ or $x^{2^i}(x+x^{2^m})$, provided that \[ \ell(n):=\min_{γ:~\F_2(γ)=\F_{2^n}} \dim_{\F_2}\F_2[Ï]γ>m, \] where $Ï$ is the Frobenius $(x\mapsto x^2)$ on $\F_{2^n}$, and $\gcd(d_1,d_2,2^m-1)>1$. Under this condition, we also establish two bounds on the nonlinearity and the differential uniformity of $F$ by means of the cardinality of its image set.
CVAug 27, 2025
Divide, Weight, and Route: Difficulty-Aware Optimization with Dynamic Expert Fusion for Long-tailed RecognitionXiaolei Wei, Yi Ouyang, Haibo Ye
Long-tailed visual recognition is challenging not only due to class imbalance but also because of varying classification difficulty across categories. Simply reweighting classes by frequency often overlooks those that are intrinsically hard to learn. To address this, we propose \textbf{DQRoute}, a modular framework that combines difficulty-aware optimization with dynamic expert collaboration. DQRoute first estimates class-wise difficulty based on prediction uncertainty and historical performance, and uses this signal to guide training with adaptive loss weighting. On the architectural side, DQRoute employs a mixture-of-experts design, where each expert specializes in a different region of the class distribution. At inference time, expert predictions are weighted by confidence scores derived from expert-specific OOD detectors, enabling input-adaptive routing without the need for a centralized router. All components are trained jointly in an end-to-end manner. Experiments on standard long-tailed benchmarks demonstrate that DQRoute significantly improves performance, particularly on rare and difficult classes, highlighting the benefit of integrating difficulty modeling with decentralized expert routing.
CVMay 20, 2025
Beyond Words: Multimodal LLM Knows When to SpeakZikai Liao, Yi Ouyang, Yi-Lun Lee et al.
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
CLJun 20, 2024
AutoPal: Autonomous Adaptation to Users for Personal AI CompanionshipYi Cheng, Wenge Liu, Kaishuai Xu et al.
Previous research has demonstrated the potential of AI agents to act as companions that can provide constant emotional support for humans. In this paper, we emphasize the necessity of autonomous adaptation in personal AI companionship, an underexplored yet promising direction. Such adaptability is crucial as it can facilitate more tailored interactions with users and allow the agent to evolve in response to users' changing needs. However, imbuing agents with autonomous adaptability presents unique challenges, including identifying optimal adaptations to meet users' expectations and ensuring a smooth transition during the adaptation process. To address them, we devise a hierarchical framework, AutoPal, that enables controllable and authentic adjustments to the agent's persona based on user interactions. A personamatching dataset is constructed to facilitate the learning of optimal persona adaptations. Extensive experiments demonstrate the effectiveness of AutoPal and highlight the importance of autonomous adaptability in AI companionship.
CLJun 16, 2024
COOL: Comprehensive Knowledge Enhanced Prompt Learning for Domain Adaptive Few-shot Fake News DetectionYi Ouyang, Peng Wu, Li Pan
Most Fake News Detection (FND) methods often struggle with data scarcity for emerging news domain. Recently, prompt learning based on Pre-trained Language Models (PLM) has emerged as a promising approach in domain adaptive few-shot learning, since it greatly reduces the need for labeled data by bridging the gap between pre-training and downstream task. Furthermore, external knowledge is also helpful in verifying emerging news, as emerging news often involves timely knowledge that may not be contained in the PLM's outdated prior knowledge. To this end, we propose COOL, a Comprehensive knOwledge enhanced prOmpt Learning method for domain adaptive few-shot FND. Specifically, we propose a comprehensive knowledge extraction module to extract both structured and unstructured knowledge that are positively or negatively correlated with news from external sources, and adopt an adversarial contrastive enhanced hybrid prompt learning strategy to model the domain-invariant news-knowledge interaction pattern for FND. Experimental results demonstrate the superiority of COOL over various state-of-the-arts.
SYAug 19, 2021
A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systemsMukul Gagrani, Sagar Sudhakara, Aditya Mahajan et al.
We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al (arXiv:1709.04047). The regret bound of the algorithm was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we show that by making a minor modification in the algorithm (in particular, ensuring that an episode does not end too soon), this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system. The modified algorithm has the same Bayesian regret of $\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ is the time-horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in~$T$.
SYAug 18, 2021
Scalable regret for learning to control network-coupled subsystems with unknown dynamicsSagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar et al.
We consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems connected over a network. Our goal is to minimize and quantify the regret (i.e. loss in performance) of our strategy with respect to an oracle who knows the system model. Viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network. We show that the expected regret of the proposed algorithm is bounded by $\tilde{\mathcal{O}} \big( n \sqrt{T} \big)$ where $n$ is the number of subsystems, $T$ is the time horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in $n$ and $T$. Thus, the regret scales linearly with the number of subsystems. We present numerical experiments to illustrate the salient features of the proposed algorithm.
LGFeb 18, 2021
Training a Resilient Q-Network against Observational InterferenceChao-Han Huck Yang, I-Te Danny Hung, Yi Ouyang et al.
Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as black-out, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an essential yet challenging task. In this paper, we consider a deep q-network (DQN) framework training with an auxiliary task of observational interferences such as artificial noises. Inspired by causal inference for observational interference, we propose a causal inference based DQN algorithm called causal inference Q-network (CIQ). We evaluate the performance of CIQ in several benchmark DQN environments with different types of interferences as auxiliary labels. Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.
LGNov 24, 2020
Discovering Avoidable Planner Failures of Autonomous Vehicles using Counterfactual Analysis in Behaviorally Diverse SimulationDaisuke Nishiyama, Mario Ynocente Castro, Shirou Maruyama et al.
Automated Vehicles require exhaustive testing in simulation to detect as many safety-critical failures as possible before deployment on public roads. In this work, we focus on the core decision-making component of autonomous robots: their planning algorithm. We introduce a planner testing framework that leverages recent progress in simulating behaviorally diverse traffic participants. Using large scale search, we generate, detect, and characterize dynamic scenarios leading to collisions. In particular, we propose methods to distinguish between unavoidable and avoidable accidents, focusing especially on automatically finding planner-specific defects that must be corrected before deployment. Through experiments in complex multi-agent intersection scenarios, we show that our method can indeed find a wide range of critical planner failures.
SYNov 9, 2020
Thompson sampling for linear quadratic mean-field teamsMukul Gagrani, Sagar Sudhakara, Aditya Mahajan et al.
We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents through the mean-field (i.e., empirical mean) of the states and controls. Directly using single-agent LQ learning algorithms in such models results in regret which increases polynomially with the number of agents. We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of $|M|$ different types at time horizon $T$ is $\tilde{\mathcal{O}} \big( |M|^{1.5} \sqrt{T} \big)$ irrespective of the total number of agents, where the $\tilde{\mathcal{O}}$ notation hides logarithmic factors in $T$. We present detailed numerical experiments to illustrate the salient features of the proposed algorithm.
LGJan 27, 2020
Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical SystemsSeyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.
LGDec 9, 2019
Learning Latent State Spaces for Planning through Reward PredictionAaron Havens, Yi Ouyang, Prabhat Nagarajan et al.
Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction model and then plans in the latent state-space. The latent representation is learned exclusively from multi-step reward prediction which we show to be the only necessary information for successful planning. With this framework, we are able to benefit from the concise model-free representation, while still enjoying the data-efficiency of model-based algorithms. We demonstrate our framework in multi-pendulum and multi-cheetah environments where several pendulums or cheetahs are shown to the agent but only one of which produces rewards. In these environments, it is important for the agent to construct a concise latent representation to filter out irrelevant observations. We find that our method can successfully learn an accurate latent reward prediction model in the presence of the irrelevant information while existing model-based methods fail. Planning in the learned latent state-space shows strong performance and high sample efficiency over model-free and model-based baselines.
ROOct 17, 2019
Online Learning in Planar Pushing with Combined Prediction ModelHuidong Gao, Yi Ouyang, Masayoshi Tomizuka
Pushing is a useful robotic capability for positioning and reorienting objects. The ability to accurately predict the effect of pushes can enable efficient trajectory planning and complicated object manipulation. Physical prediction models for planar pushing have long been established, but their assumptions and requirements usually don't hold in most practical settings. Data-driven approaches can provide accurate predictions for offline data, but they often have generalizability issues. In this paper, we propose a combined prediction model and an online learning framework for planar push prediction. The combined model consists of a neural network module and analytical components with a low-dimensional parameter. We train the neural network offline using pre-collected pushing data. In online situations, the low-dimensional analytical parameter is learned directly from online pushes to quickly adapt to the new environments. We test our combined model and learning framework on real pushing experiments. Our experimental results show that our model is able to quickly adapt to new environments while achieving similar final prediction performance as that of pure neural network models.
LGMay 24, 2019
Learning Cross-Domain Representation with Multi-Graph Neural NetworkYi Ouyang, Bin Guo, Xing Tang et al.
Learning effective embedding has been proved to be useful in many real-world problems, such as recommender systems, search ranking and online advertisement. However, one of the challenges is data sparsity in learning large-scale item embedding, as users' historical behavior data are usually lacking or insufficient in an individual domain. In fact, user's behaviors from different domains regarding the same items are usually relevant. Therefore, we can learn complete user behaviors to alleviate the sparsity using complementary information from correlated domains. It is intuitive to model users' behaviors using graph, and graph neural networks (GNNs) have recently shown the great power for representation learning, which can be used to learn item embedding. However, it is challenging to transfer the information across domains and learn cross-domain representation using the existing GNNs. To address these challenges, in this paper, we propose a novel model - Deep Multi-Graph Embedding (DMGE) to learn cross-domain representation. Specifically, we first construct a multi-graph based on users' behaviors from different domains, and then propose a multi-graph neural network to learn cross-domain representation in an unsupervised manner. Particularly, we present a multiple-gradient descent optimizer for efficiently training the model. We evaluate our approach on various large-scale real-world datasets, and the experimental results show that DMGE outperforms other state-of-art embedding methods in various tasks.
HCFeb 15, 2018
CompetitiveBike: Competitive Prediction of Bike-Sharing Apps Using Heterogeneous Crowdsourced DataYi Ouyang, Bin Guo, Xinjiang Lu et al.
In recent years, bike-sharing systems have been deployed in many cities, which provide an economical lifestyle. With the prevalence of bike-sharing systems, a lot of companies join the market, leading to increasingly fierce competition. To be competitive, bike-sharing companies and app developers need to make strategic decisions for mobile apps development. Therefore, it is significant to predict and compare the popularity of different bike-sharing apps. However, existing works mostly focus on predicting the popularity of a single app, the popularity contest among different apps has not been explored yet. In this paper, we aim to forecast the popularity contest between Mobike and Ofo, two most popular bike-sharing apps in China. We develop CompetitiveBike, a system to predict the popularity contest among bike-sharing apps. Moreover, we conduct experiments on real-world datasets collected from 11 app stores and Sina Weibo, and the experiments demonstrate the effectiveness of our approach.
LGSep 14, 2017
Learning Unknown Markov Decision Processes: A Thompson Sampling ApproachYi Ouyang, Mukul Gagrani, Ashutosh Nayyar et al.
We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. The first stopping criterion controls the growth rate of episode length. The second stopping criterion happens when the number of visits to any state-action pair is doubled. We establish $\tilde O(HS\sqrt{AT})$ bounds on expected regret under a Bayesian setting, where $S$ and $A$ are the sizes of the state and action spaces, $T$ is time, and $H$ is the bound of the span. This regret bound matches the best available bound for weakly communicating MDPs. Numerical results show it to perform better than existing algorithms for infinite horizon MDPs.
SYSep 12, 2017
Learning-based Control of Unknown Linear Systems with Thompson SamplingYi Ouyang, Mukul Gagrani, Rahul Jain
We propose a Thompson sampling-based learning algorithm for the Linear Quadratic (LQ) control problem with unknown system parameters. The algorithm is called Thompson sampling with dynamic episodes (TSDE) where two stopping criteria determine the lengths of the dynamic episodes in Thompson sampling. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is triggered when the determinant of the sample covariance matrix is less than half of the previous value. We show under some conditions on the prior distribution that the expected (Bayesian) regret of TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on expected regret of learning in LQ control. By introducing a reinitialization schedule, we also show that the algorithm is robust to time-varying drift in model parameters. Numerical simulations are provided to illustrate the performance of TSDE.
GTOct 23, 2015
Dynamic Games with Asymmetric Information: Common Information Based Perfect Bayesian Equilibria and Sequential DecompositionYi Ouyang, Hamidreza Tavafoghi, Demosthenis Teneketzis
We formulate and analyze a general class of stochastic dynamic games with asymmetric information arising in dynamic systems. In such games, multiple strategic agents control the system dynamics and have different information about the system over time. Because of the presence of asymmetric information, each agent needs to form beliefs about other agents' private information. Therefore, the specification of the agents' beliefs along with their strategies is necessary to study the dynamic game. We use Perfect Bayesian equilibrium (PBE) as our solution concept. A PBE consists of a pair of strategy profile and belief system. In a PBE, every agent's strategy should be a best response under the belief system, and the belief system depends on agents' strategy profile when there is signaling among agents. Therefore, the circular dependence between strategy profile and belief system makes it difficult to compute PBE. Using the common information among agents, we introduce a subclass of PBE called common information based perfect Bayesian equilibria (CIB-PBE), and provide a sequential decomposition of the dynamic game. Such decomposition leads to a backward induction algorithm to compute CIB-PBE. We illustrate the sequential decomposition with an example of a multiple access broadcast game. We prove the existence of CIB-PBE for a subclass of dynamic games.
NIJun 28, 2015
A Common Information-Based Multiple Access Protocol Achieving Full Throughput and Linear DelayYi Ouyang, Demosthenis Teneketzis
We consider a multiple access communication system where multiple users share a common collision channel. Each user observes its local traffic and the feedback from the channel. At each time instant the feedback from the channel is one of three messages: no transmission, successful transmission, collision. The objective is to design a transmission protocol that coordinates the users' transmissions and achieves high throughput and low delay. We present a decentralized Common Information-Based Multiple Access (CIMA) protocol that has the following features: (i) it achieves the full throughput region of the collision channel; (ii) it results in a delay that is linear in the number of users, and is significantly lower than that of CSMA protocols; (iii) it avoids collisions without channel sensing.