MLFeb 16, 2021
The Elliptical Potential Lemma for General Distributions with an Application to Linear Thompson SamplingNima Hamidi, Mohsen Bayati
In this note, we introduce a general version of the well-known elliptical potential lemma that is a widely used technique in the analysis of algorithms in sequential learning and decision-making problems. We consider a stochastic linear bandit setting where a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward over a decision-making horizon. The elliptical potential lemma is a key tool for quantifying uncertainty in estimating parameters of the reward function, but it requires the noise and the prior distributions to be Gaussian. Our general elliptical potential lemma relaxes this Gaussian requirement which is a highly non-trivial extension for a number of reasons; unlike the Gaussian case, there is no closed-form solution for the covariance matrix of the posterior distribution, the covariance matrix is not a deterministic function of the actions, and the covariance matrix is not decreasing with respect to the semidefinite inequality. While this result is of broad interest, we showcase an application of it to prove an improved Bayesian regret bound for the well-known Thompson sampling algorithm in stochastic linear bandits with changing action sets where prior and noise distributions are general. This bound is minimax optimal up to constants.
LGJun 11, 2020
On Frequentist Regret of Linear Thompson SamplingNima Hamidi, Mohsen Bayati
This paper studies the stochastic linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $\mathbb{R}^d$ and receives noisy rewards. The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions. Linear Thompson Sampling (LinTS) is a popular Bayesian heuristic, supported by theoretical analysis that shows its Bayesian regret is bounded by $\widetilde{\mathcal{O}}(d\sqrt{T})$, matching minimax lower bounds. However, previous studies demonstrate that the frequentist regret bound for LinTS is $\widetilde{\mathcal{O}}(d\sqrt{dT})$, which requires posterior variance inflation and is by a factor of $\sqrt{d}$ worse than the best optimism-based algorithms. We prove that this inflation is fundamental and that the frequentist bound of $\widetilde{\mathcal{O}}(d\sqrt{dT})$ is the best possible, by demonstrating a randomization bias phenomenon in LinTS that can cause linear regret without inflation.We propose a data-driven version of LinTS that adjusts posterior inflation using observed data, which can achieve minimax optimal frequentist regret, under additional conditions. Our analysis provides new insights into LinTS and settles an open problem in the field.
LGFeb 24, 2020
The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many ArmsMohsen Bayati, Nima Hamidi, Ramesh Johari et al.
We investigate a Bayesian $k$-armed bandit problem in the \emph{many-armed} regime, where $k \geq \sqrt{T}$ and $T$ represents the time horizon. Initially, and aligned with recent literature on many-armed bandit problems, we observe that subsampling plays a key role in designing optimal algorithms; the conventional UCB algorithm is sub-optimal, whereas a subsampled UCB (SS-UCB), which selects $Θ(\sqrt{T})$ arms for execution under the UCB framework, achieves rate-optimality. However, despite SS-UCB's theoretical promise of optimal regret, it empirically underperforms compared to a greedy algorithm that consistently chooses the empirically best arm. This observation extends to contextual settings through simulations with real-world data. Our findings suggest a new form of \emph{free exploration} beneficial to greedy algorithms in the many-armed context, fundamentally linked to a tail event concerning the prior distribution of arm rewards. This finding diverges from the notion of free exploration, which relates to covariate variation, as recently discussed in contextual bandit literature. Expanding upon these insights, we establish that the subsampled greedy approach not only achieves rate-optimality for Bernoulli bandits within the many-armed regime but also attains sublinear regret across broader distributions. Collectively, our research indicates that in the many-armed regime, practitioners might find greater value in adopting greedy algorithms.
LGFeb 12, 2020
A General Theory of the Stochastic Linear Bandit and Its ApplicationsNima Hamidi, Mohsen Bayati
Recent growing adoption of experimentation in practice has led to a surge of attention to multiarmed bandits as a technique to reduce the opportunity cost of online experiments. In this setting, a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward (or minimize regret) over a horizon of length $T$. In this paper, we introduce a general analysis framework and a family of algorithms for the stochastic linear bandit problem that includes well-known algorithms such as the optimism-in-the-face-of-uncertainty-linear-bandit (OFUL) and Thompson sampling (TS) as special cases. Our analysis technique bridges several streams of prior literature and yields a number of new results. First, our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. SG utilizes the data to discard actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS. The second application of our general framework is (to the best of our knowledge) the first polylogarithmic (in $T$) regret bounds for OFUL and TS, under similar conditions as the ones by Goldenshluger and Zeevi (2013). Finally, we obtain sharper regret bounds for the $k$-armed contextual MABs by a factor of $\sqrt{k}$.
SDJun 16, 2019
Multi-scale Embedded CNN for Music Tagging (MsE-CNN)Nima Hamidi, Mohsen Vahidzadeh, Stephen Baek
Convolutional neural networks (CNN) recently gained notable attraction in a variety of machine learning tasks: including music classification and style tagging. In this work, we propose implementing intermediate connections to the CNN architecture to facilitate the transfer of multi-scale/level knowledge between different layers. Our novel model for music tagging shows significant improvement in comparison to the proposed approaches in the literature, due to its ability to carry low-level timbral features to the last layer.
LGApr 18, 2019
On Low-rank Trace Regression under General Sampling DistributionNima Hamidi, Mohsen Bayati
In this paper, we study the trace regression when a matrix of parameters B* is estimated via the convex relaxation of a rank-regularized regression or via regularized non-convex optimization. It is known that these estimators satisfy near-optimal error bounds under assumptions on the rank, coherence, and spikiness of B*. We start by introducing a general notion of spikiness for B* that provides a generic recipe to prove the restricted strong convexity of the sampling operator of the trace regression and obtain near-optimal and non-asymptotic error bounds for the estimation error. Similar to the existing literature, these results require the regularization parameter to be above a certain theory-inspired threshold that depends on observation noise that may be unknown in practice. Next, we extend the error bounds to cases where the regularization parameter is chosen via cross-validation. This result is significant in that existing theoretical results on cross-validated estimators (Kale et al., 2011; Kumar et al., 2013; Abou-Moustafa and Szepesvari, 2017) do not apply to our setting since the estimators we study are not known to satisfy their required notion of stability. Finally, using simulations on synthetic and real data, we show that the cross-validated estimator selects a near-optimal penalty parameter and outperforms the theory-inspired approach of selecting the parameter.