Vivek S. Borkar

h-index50

38papers

570citations

Novelty43%

AI Score38

Ranked #87,562 of 194,257 authors (top 45%)#19,429 in LG (top 48%)

38 Papers

2.4OCJul 12, 2014

Greedy Block Coordinate Descent (GBCD) Method for High Dimensional Quadratic Programs

Gugan Thoppe, Vivek S. Borkar, Dinesh Garg · ibm-research

High dimensional unconstrained quadratic programs (UQPs) involving massive datasets are now common in application areas such as web, social networks, etc. Unless computational resources that match up to these datasets are available, solving such problems using classical UQP methods is very difficult. This paper discusses alternatives. We first define high dimensional compliant (HDC) methods for UQPs---methods that can solve high dimensional UQPs by adapting to available computational resources. We then show that the class of block Kaczmarz and block coordinate descent (BCD) are the only existing methods that can be made HDC. As a possible answer to the question of the `best' amongst BCD methods for UQP, we propose a novel greedy BCD (GBCD) method with serial, parallel and distributed variants. Convergence rates and numerical tests confirm that the GBCD is indeed an effective method to solve high dimensional UQPs. In fact, it sometimes beats even the conjugate gradient.

1.2SYOct 17, 2017

Opportunistic Scheduling as Restless Bandits

Vivek S. Borkar, Gaurav S. Kasbekar, Sarath Pattathil et al.

In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and differ across users; also, the cost incurred, i.e., energy consumed, in packet transmission is a function of the channel quality. We pose the problem as an average cost Markov Decision Problem, and prove that this problem is Whittle Indexable. Based on this result, we propose an algorithm in which the Whittle index of each user is computed and the user who has the lowest value is selected for transmission. We evaluate the performance of this algorithm via simulations and show that it achieves a lower average cost than the Maximum Weight Scheduling and Weighted Fair Scheduling strategies.

2.3PRJan 23, 2013

Asymptotics of the Invariant Measure in Mean Field Models with Jumps

Vivek S. Borkar, Rajesh Sundaresan

We consider the asymptotics of the invariant measure for the process of the empirical spatial distribution of $N$ coupled Markov chains in the limit of a large number of chains. Each chain reflects the stochastic evolution of one particle. The chains are coupled through the dependence of the transition rates on this spatial distribution of particles in the various states. Our model is a caricature for medium access interactions in wireless local area networks. It is also applicable to the study of spread of epidemics in a network. The limiting process satisfies a deterministic ordinary differential equation called the McKean-Vlasov equation. When this differential equation has a unique globally asymptotically stable equilibrium, the spatial distribution asymptotically concentrates on this equilibrium. More generally, its limit points are supported on a subset of the $ω$-limit sets of the McKean-Vlasov equation. Using a control-theoretic approach, we examine the question of large deviations of the invariant measure from this limit.

9.2SYNov 3, 2022

Reinforcement Learning in Non-Markovian Environments

Siddharth Chandak, Pratik Shah, Vivek S Borkar et al.

Motivated by the novel paradigm developed by Van Roy and coauthors for reinforcement learning in arbitrary non-Markovian environments, we propose a related formulation and explicitly pin down the error caused by non-Markovianity of observations when the Q-learning algorithm is applied on this formulation. Based on this observation, we propose that the criterion for agent design should be to seek good approximations for certain conditional laws. Inspired by classical stochastic control, we show that our problem reduces to that of recursive computation of approximate sufficient statistics. This leads to an autoencoder-based scheme for agent design which is then numerically tested on partially observed reinforcement learning environments.

8.7LGOct 10, 2022Code

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

2.0OCAug 14, 2023

Average cost optimal control under weak ergodicity hypotheses: Relative value iterations

Ari Arapostathis, Vivek S. Borkar

We study Markov decision processes with Polish state and action spaces. The action space is state dependent and is not necessarily compact. We first establish the existence of an optimal ergodic occupation measure using only a near-monotone hypothesis on the running cost. Then we study the well-posedness of Bellman equation, or what is commonly known as the average cost optimality equation, under the additional hypothesis of the existence of a small set. We deviate from the usual approach which is based on the vanishing discount method and instead map the problem to an equivalent one for a controlled split chain. We employ a stochastic representation of the Poisson equation to derive the Bellman equation. Next, under suitable assumptions, we establish convergence results for the 'relative value iteration' algorithm which computes the solution of the Bellman equation recursively. In addition, we present some results concerning the stability and asymptotic optimality of the associated rolling horizon policies.

5.9SYApr 7, 2023

Full Gradient Deep Reinforcement Learning for Average-Reward Criterion

Tejas Pagare, Vivek Borkar, Konstantin Avrachenkov

We extend the provably convergent Full Gradient DQN algorithm for discounted reward Markov decision processes from Avrachenkov et al. (2021) to average reward problems. We experimentally compare widely used RVI Q-Learning with recently proposed Differential Q-Learning in the neural function approximation setting with Full Gradient DQN and DQN. We also extend this to learn Whittle indices for Markovian restless multi-armed bandits. We observe a better convergence rate of the proposed Full Gradient variant across different tasks.

1.2PFFeb 9, 2019

Distributed Server Allocation for Content Delivery Networks

Sarath Pattathil, Vivek S. Borkar, Gaurav S. Kasbekar

We propose a dynamic formulation of file-sharing networks in terms of an average cost Markov decision process with constraints. By analyzing a Whittle-like relaxation thereof, we propose an index policy in the spirit of Whittle and compare it by simulations with other natural heuristics.

3.3SYNov 21, 2023

Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

Keshav P. Keval, Vivek S. Borkar

In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs.

1.2SYNov 24, 2023

Approximation of Convex Envelope Using Reinforcement Learning

Vivek S. Borkar, Adit Akarsh

Oberman gave a stochastic control formulation of the problem of estimating the convex envelope of a non-convex function. Based on this, we develop a reinforcement learning scheme to approximate the convex envelope, using a variant of Q-learning for controlled optimal stopping. It shows very promising results on a standard library of test problems.

2.1MLOct 9, 2022

A Concentration Bound for Distributed Stochastic Approximation

Harsh Dolhare, Vivek Borkar

We revisit the classical model of Tsitsiklis, Bertsekas and Athans for distributed stochastic approximation with consensus. The main result is an analysis of this scheme using the ODE approach to stochastic approximation, leading to a high probability bound for the tracking error between suitably interpolated iterates and the limiting differential equation. Several future directions will also be highlighted.

1.8LGJun 7, 2022

Concentration bounds for SSP Q-learning for average cost MDPs

Shaan Ul Haque, Vivek Borkar

We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.

13.0LGDec 16, 2023

A Concentration Bound for TD(0) with Function Approximation

Siddharth Chandak, Vivek S. Borkar

We derive a concentration bound of the type `for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.

6.4LGDec 17, 2024

Lagrangian Index Policy for Restless Bandits with Average Reward

Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah

We study the Lagrange Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrange index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.

4.5MLJun 23, 2025

Asymptotic convexity of wide and shallow neural networks

Vivek Borkar, Parthe Pandit

For a simple model of shallow and wide neural networks, we show that the epigraph of its input-output map as a function of the network parameters approximates epigraph of a. convex function in a precise sense. This leads to a plausible explanation of their observed good performance.

3.3PRJun 11, 2025

A theoretical basis for model collapse in recursive training

Vivek Shripad Borkar

It is known that recursive training from generative models can lead to the so called `collapse' of the simulated probability distribution. This note shows that one in fact gets two different asymptotic behaviours depending on whether an external source, howsoever minor, is also contributing samples.

4.1LGFeb 17, 2025

An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Soumyajit Guin, Vivek S. Borkar, Shalabh Bhatnagar

In this paper, we consider the risk-sensitive cost criterion with exponentiated costs for Markov decision processes and develop a model-free policy gradient algorithm in this setting. Unlike additive cost criteria such as average or discounted cost, the risk-sensitive cost criterion is less studied due to the complexity resulting from the multiplicative structure of the resulting Bellman equation. We develop an actor-critic algorithm with function approximation in this setting and provide its asymptotic convergence analysis. We also show the results of numerical experiments that demonstrate the superiority in performance of our algorithm over other recent algorithms in the literature.

7.3AIJun 4, 2024

Tabular and Deep Learning for the Whittle Index

Francisco Robledo Relaño, Vivek Borkar, Urtzi Ayesta et al.

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

3.1LGNov 4, 2021

A Concentration Bound for LSPE($λ$)

Siddharth Chandak, Vivek S. Borkar, Harsh Dolhare

The popular LSPE($λ$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.

12.2STOct 27, 2021

The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

Vivek Borkar, Shuhang Chen, Adithya Devraj et al.

The paper concerns the $d$-dimensional stochastic approximation recursion, $$ θ_{n+1}= θ_n + α_{n + 1} f(θ_n, Φ_{n+1}) $$ where $ \{ Φ_n \}$ is a stochastic process on a general state space, satisfying a conditional Markov property that allows for parameter-dependent noise. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3): (i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$. (ii) A functional central limit theorem (CLT) is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\textsf{E}[ z_n z_n^T ]$ to the asymptotic covariance in the CLT, where $z_n =: (θ_n-θ^*)/\sqrt{α_n}$. (iii) The CLT holds for the normalized version $z^{\text{PR}}_n =: \sqrt{n} [θ^{\text{PR}}_n -θ^*]$, of the averaged parameters $θ^{\text{PR}}_n =:n^{-1} \sum_{k=1}^nθ_k$, subject to standard assumptions on the step-size. Moreover, the covariance in the CLT coincides with the minimal covariance of Polyak and Ruppert. (iv) An example is given where $f$ and $\bar{f}$ are linear in $θ$, and $Φ$ is a geometrically ergodic Markov chain but does not satisfy (DV3). While the algorithm is convergent, the second moment of $θ_n$ is unbounded and in fact diverges. This arXiv version represents a major extension of the results in prior versions.The main results now allow for parameter-dependent noise, as is often the case in applications to reinforcement learning.

17.5LGJun 27, 2021

Concentration of Contractive Stochastic Approximation and Reinforcement Learning

Siddharth Chandak, Vivek S. Borkar, Parth Dodhia

Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0).

1.6LGFeb 15, 2021

A Unified Batch Selection Policy for Active Metric Learning

Priyadarshini K, Siddhartha Chaudhuri, Vivek Borkar et al.

Active metric learning is the problem of incrementally selecting high-utility batches of training data (typically, ordered triplets) to annotate, in order to progressively improve a learned model of a metric over some input domain as rapidly as possible. Standard approaches, which independently assess the informativeness of each triplet in a batch, are susceptible to highly correlated batches with many redundant triplets and hence low overall utility. While a recent work \cite{kumari2020batch} proposes batch-decorrelation strategies for metric learning, they rely on ad hoc heuristics to estimate the correlation between two triplets at a time. We present a novel batch active metric learning method that leverages the Maximum Entropy Principle to learn the least biased estimate of triplet distribution for a given set of prior constraints. To avoid redundancy between triplets, our method collectively selects batches with maximum joint entropy, which simultaneously captures both informativeness and diversity. We take advantage of the submodularity of the joint entropy function to construct a tractable solution using an efficient greedy algorithm based on Gram-Schmidt orthogonalization that is provably $\left( 1 - \frac{1}{e} \right)$-optimal. Our approach is the first batch active metric learning method to define a unified score that balances informativeness and diversity for an entire batch of triplets. Experiments with several real-world datasets demonstrate that our algorithm is robust, generalizes well to different applications and input modalities, and consistently outperforms the state-of-the-art.

4.7OCJul 8, 2020

Dynamic social learning under graph constraints

Konstantin Avrachenkov, Vivek S. Borkar, Sharayu Moharir et al.

We introduce a model of graph-constrained dynamic choice with reinforcement modeled by positively $α$-homogeneous rewards. We show that its empirical process, which can be written as a stochastic approximation recursion with Markov noise, has the same probability law as a certain vertex reinforced random walk. We use this equivalence to show that for $α> 0$, the asymptotic outcome concentrates around the optimum in a certain limiting sense when `annealed' by letting $α\uparrow\infty$ slowly.

19.1LGApr 29, 2020

Whittle index based Q-learning for restless bandits with average reward

Konstantin E. Avrachenkov, Vivek S. Borkar

A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.

8.6LGDec 21, 2019

Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes

Arghyadip Roy, Vivek Borkar, Abhay Karandikar et al.

To overcome the curses of dimensionality and modeling of Dynamic Programming (DP) methods to solve Markov Decision Process (MDP) problems, Reinforcement Learning (RL) methods are adopted in practice. Contrary to traditional RL algorithms which do not consider the structural properties of the optimal policy, we propose a structure-aware learning algorithm to exploit the ordered multi-threshold structure of the optimal policy, if any. We prove the asymptotic convergence of the proposed algorithm to the optimal policy. Due to the reduction in the policy space, the proposed algorithm provides remarkable improvements in storage and computational complexities over classical RL algorithms. Simulation results establish that the proposed algorithm converges faster than other RL algorithms.

3.3SIOct 19, 2019

Opinion shaping in social networks using reinforcement learning

Vivek Borkar, Alexandre Reiffers-Masson

In this paper, we study how to shape opinions in social networks when the matrix of interactions is unknown. We consider classical opinion dynamics with some stubborn agents and the possibility of continuously influencing the opinions of a few selected agents, albeit under resource constraints. We map the opinion dynamics to a value iteration scheme for policy evaluation for a specific stochastic shortest path problem. This leads to a representation of the opinion vector as an approximate value function for a stochastic shortest path problem with some non-classical constraints. We suggest two possible ways of influencing agents. One leads to a convex optimization problem and the other to a non-convex one. Firstly, for both problems, we propose two different online two-time scale reinforcement learning schemes that converge to the optimal solution of each problem. Secondly, we suggest stochastic gradient descent schemes and compare these classes of algorithms with the two-time scale reinforcement learning schemes. Thirdly, we also derive another algorithm designed to tackle the curse of dimensionality one faces when all agents are observed. Numerical studies are provided to illustrate the convergence and efficiency of our algorithms.

2.9LGNov 28, 2018

A Structure-aware Online Learning Algorithm for Markov Decision Processes

Arghyadip Roy, Vivek Borkar, Abhay Karandikar et al.

To overcome the curse of dimensionality and curse of modeling in Dynamic Programming (DP) methods for solving classical Markov Decision Process (MDP) problems, Reinforcement Learning (RL) algorithms are popular. In this paper, we consider an infinite-horizon average reward MDP problem and prove the optimality of the threshold policy under certain conditions. Traditional RL techniques do not exploit the threshold nature of optimal policy while learning. In this paper, we propose a new RL algorithm which utilizes the known threshold structure of the optimal policy while learning by reducing the feasible policy space. We establish that the proposed algorithm converges to the optimal policy. It provides a significant improvement in convergence speed and computational and storage complexity over traditional RL algorithms. The proposed technique can be applied to a wide variety of optimization problems that include energy efficient data transmission and management of queues. We exhibit the improvement in convergence speed of the proposed algorithm over other RL algorithms through simulations.

3.2ROSep 11, 2017

Vector Field Guidance for Convoy Monitoring Using Elliptical Orbits

Aseem V. Borkar, Vivek S. Borkar, Arpita Sinha

We propose a novel vector field based guidance scheme for tracking and surveillance of a convoy, moving along a possibly nonlinear trajectory on the ground, by an aerial agent. The scheme first computes a time varying ellipse that encompasses all the targets in the convoy using a simple regression based algorithm. It then ensures convergence of the agent to a trajectory that repeatedly traverses this moving ellipse. The scheme is analyzed using perturbation theory of nonlinear differential equations and supporting simulations are provided. Some related implementation issues are discussed and advantages of the scheme are highlighted.

1.2SYAug 28, 2017

Distributed Stochastic Approximation with Local Projections

Suhail M. Shah, Vivek S. Borkar

We propose a distributed version of a stochastic approximation scheme constrained to remain in the intersection of a finite family of convex sets. The projection to the intersection of these sets is also computed in a distributed manner and a `nonlinear gossip' mechanism is employed to blend the projection iterations with the stochastic approximation using multiple time scales

1.2SYJul 13, 2017

Whittle Indexability in Egalitarian Processor Sharing Systems

Vivek S. Borkar, Sarath Pattathil

The egalitarian processor sharing model is viewed as a restless bandit and its Whittle indexability is established. A numerical scheme for computing the Whittle indices is provided, along with supporting numerical experiments.

1.9LGMay 9, 2016

Randomized Kaczmarz for Rank Aggregation from Pairwise Comparisons

Vivek S. Borkar, Nikhil Karamchandani, Sharad Mirani

We revisit the problem of inferring the overall ranking among entities in the framework of Bradley-Terry-Luce (BTL) model, based on available empirical data on pairwise preferences. By a simple transformation, we can cast the problem as that of solving a noisy linear system, for which a ready algorithm is available in the form of the randomized Kaczmarz method. This scheme is provably convergent, has excellent empirical performance, and is amenable to on-line, distributed and asynchronous variants. Convergence, convergence rate, and error analysis of the proposed algorithm are presented and several numerical experiments are conducted whose results validate our theoretical findings.

5.1MLNov 27, 2015

Gradient Estimation with Simultaneous Perturbation and Compressive Sensing

Vivek S. Borkar, Vikranth R. Dwaracherla, Neeraja Sahasrabudhe

This paper aims at achieving a "good" estimator for the gradient of a function on a high-dimensional space. Often such functions are not sensitive in all coordinates and the gradient of the function is almost sparse. We propose a method for gradient estimation that combines ideas from Spall's Simultaneous Perturbation Stochastic Approximation with compressive sensing. The aim is to obtain "good" estimator without too many function evaluations. Application to estimating gradient outer product matrix as well as standard optimization problems are illustrated via simulations.

2.1LGSep 4, 2015

Parallel and Distributed Approaches for Graph Based Semi-supervised Learning

Konstantin Avrachenkov, Vivek Borkar, Krishnakant Saboo

Two approaches for graph based semi-supervised learning are proposed. The firstapproach is based on iteration of an affine map. A key element of the affine map iteration is sparsematrix-vector multiplication, which has several very efficient parallel implementations. The secondapproach belongs to the class of Markov Chain Monte Carlo (MCMC) algorithms. It is based onsampling of nodes by performing a random walk on the graph. The latter approach is distributedby its nature and can be easily implemented on several processors or over the network. Boththeoretical and practical evaluations are provided. It is found that the nodes are classified intotheir class with very small error. The sampling algorithm's ability to track new incoming nodesand to classify them is also demonstrated.

17.1IRMar 30, 2015

Whittle Index Policy for Crawling Ephemeral Content

Konstantin Avrachenkov, Vivek Borkar

We consider a task of scheduling a crawler to retrieve content from several sites with ephemeral content. A user typically loses interest in ephemeral content, like news or posts at social network groups, after several days or hours. Thus, development of timely crawling policy for such ephemeral information sources is very important. We first formulate this problem as an optimal control problem with average reward. The reward can be measured in the number of clicks or relevant search requests. The problem in its initial formulation suffers from the curse of dimensionality and quickly becomes intractable even with moderate number of information sources. Fortunately, this problem admits a Whittle index, which leads to problem decomposition and to a very simple and efficient crawling policy. We derive the Whittle index and provide its theoretical justification.

11.8OCNov 30, 2014

Empirical Q-Value Iteration

Dileep Kalathil, Vivek S. Borkar, Rahul Jain

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm doesn't depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration (EQVI) algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a non-asymptotic sample complexity bound, and also show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ball park estimate for our algorithm compared to stochastic approximation-based algorithms.

3.7LGNov 3, 2014

Approachability in Stackelberg Stochastic Games with Vector Costs

Dileep Kalathil, Vivek Borkar, Rahul Jain

The notion of approachability was introduced by Blackwell [1] in the context of vector-valued repeated games. The famous Blackwell's approachability theorem prescribes a strategy for approachability, i.e., for `steering' the average cost of a given agent towards a given target set, irrespective of the strategies of the other agents. In this paper, motivated by the multi-objective optimization/decision making problems in dynamically changing environments, we address the approachability problem in Stackelberg stochastic games with vector valued cost functions. We make two main contributions. Firstly, we give a simple and computationally tractable strategy for approachability for Stackelberg stochastic games along the lines of Blackwell's. Secondly, we give a reinforcement learning algorithm for learning the approachable strategy when the transition kernel is unknown. We also recover as a by-product Blackwell's necessary and sufficient condition for approachability for convex sets in this set up and thus a complete characterization. We also give sufficient conditions for non-convex sets.

2.9LGNov 1, 2013

Reinforcement Learning for Matrix Computations: PageRank as an Example

Vivek S. Borkar, Adwaitvedant S. Mathkar

Reinforcement learning has gained wide popularity as a technique for simulation-driven approximate dynamic programming. A less known aspect is that the very reasons that make it effective in dynamic programming can also be leveraged for using it for distributed schemes for certain matrix computations involving non-negative matrices. In this spirit, we propose a reinforcement learning algorithm for PageRank computation that is fashioned after analogous schemes for approximate dynamic programming. The algorithm has the advantage of ease of distributed implementation and more importantly, of being model-free, i.e., not dependent on any specific assumptions about the transition probabilities in the random web-surfer model. We analyze its convergence and finite time behavior and present some supporting numerical experiments.

12.2DCOct 28, 2013

Distributed Reinforcement Learning via Gossip

Adwaitvedant S. Mathkar, Vivek S. Borkar

We consider the classical TD(0) algorithm implemented on a network of agents wherein the agents also incorporate the updates received from neighboring agents using a gossip-like mechanism. The combined scheme is shown to converge for both discounted and average cost problems.