Steve Marcus

LG
3papers
100citations
Novelty53%
AI Score25

3 Papers

LGNov 30, 2016
Bandit algorithms to emulate human decision making using probabilistic distortions

Ravi Kumar Kolla, Prashanth L. A., Aditya Gopalan et al.

Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the reward distributions: the classic $K$-armed bandit and the linearly parameterized bandit settings. We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits. For the regret minimization setting in $K$-armed as well as linear bandit problems, we propose algorithms that are inspired by Upper Confidence Bound (UCB) algorithms, incorporate reward distortions, and exhibit sublinear regret. For the $K$-armed bandit setting, we derive an upper bound on the expected regret for our proposed algorithm, and then we prove a matching lower bound to establish the order-optimality of our algorithm. For the linearly parameterized setting, our algorithm achieves a regret upper bound that is of the same order as that of regular linear bandit algorithm called Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm, and unlike OFUL, our algorithm handles distortions and an arm-dependent noise model. For the best arm identification problem in the $K$-armed bandit setting, we propose algorithms, derive guarantees on their performance, and also show that these algorithms are order optimal by proving matching fundamental limits on performance. For best arm identification in linear bandits, we propose an algorithm and establish sample complexity guarantees. Finally, we present simulation experiments which demonstrate the advantages resulting from using distortion-aware learning algorithms in a vehicular traffic routing application.

LGJun 8, 2015
Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control

Prashanth L. A., Cheng Jie, Michael Fu et al.

Cumulative prospect theory (CPT) is known to model human decisions well, with substantial empirical evidence supporting this claim. CPT works by distorting probabilities and is more general than the classic expected utility and coherent risk measures. We bring this idea to a risk-sensitive reinforcement learning (RL) setting and design algorithms for both estimation and control. The RL setting presents two particular challenges when CPT is applied: estimating the CPT objective requires estimations of the entire distribution of the value function and finding a randomized optimal policy. The estimation scheme that we propose uses the empirical distribution to estimate the CPT-value of a random variable. We then use this scheme in the inner loop of a CPT-value optimization procedure that is based on the well-known simulation optimization idea of simultaneous perturbation stochastic approximation (SPSA). We provide theoretical convergence guarantees for all the proposed algorithms and also illustrate the usefulness of CPT-based criteria in a traffic signal control application.

OCFeb 19, 2015
Adaptive system optimization using random directions stochastic approximation

Prashanth L. A., Shalabh Bhatnagar, Michael Fu et al.

We present novel algorithms for simulation optimization using random directions stochastic approximation (RDSA). These include first-order (gradient) as well as second-order (Newton) schemes. We incorporate both continuous-valued as well as discrete-valued perturbations into both our algorithms. The former are chosen to be independent and identically distributed (i.i.d.) symmetric, uniformly distributed random variables (r.v.), while the latter are i.i.d., asymmetric, Bernoulli r.v.s. Our Newton algorithm, with a novel Hessian estimation scheme, requires N-dimensional perturbations and three loss measurements per iteration, whereas the simultaneous perturbation Newton search algorithm of [1] requires 2N-dimensional perturbations and four loss measurements per iteration. We prove the unbiasedness of both gradient and Hessian estimates and asymptotic (strong) convergence for both first-order and second-order schemes. We also provide asymptotic normality results, which in particular establish that the asymmetric Bernoulli variant of Newton RDSA method is better than 2SPSA of [1]. Numerical experiments are used to validate the theoretical results.