Zhiming Huang

h-index30

3papers

5citations

Novelty55%

AI Score33

Ranked #116,024 of 194,257 authors (top 60%)#25,503 in LG (top 63%)

3 Papers

4.1LGMay 5, 2025

Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Bingshan Hu, Zhiming Huang, Tianyue H. Zhang et al. · mila

We address differentially private stochastic bandit problems from the angles of exploring the deep connections among Thompson Sampling with Gaussian priors, Gaussian mechanisms, and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private bandit algorithm that enables to trade off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-α)}\right)$-GDP and enjoys an $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $α\in [0,1]$ controls the trade-off between privacy and regret. Theoretically, our DP-TS-UCB relies on anti-concentration bounds of Gaussian distributions and links exploration mechanisms in Thompson Sampling-based algorithms and Upper Confidence Bound-based algorithms, which may be of independent interest.

2.6LGMay 2, 2024

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Bingshan Hu, Zhiming Huang, Tianyue H. Zhang et al. · mila

We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.

8.4LGFeb 16, 2021

Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment

Bingshan Hu, Zhiming Huang, Nishant A. Mehta et al.

In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O \left(\sum_{j: Δ_j>0} \frac{\ln(T)}{\min \left\{Δ_j, ε\right\}} \right)$ instance-dependent regret bound, where $T$ is the finite learning horizon, $Δ_j$ denotes the suboptimality gap between the optimal arm and a suboptimal arm $j$, and $ε$ is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an $Ω\left(\frac{\ln(K)}{\min \left\{Δ_{\min}, ε\right\}} \right)$ instance-dependent regret lower bound and an $Ω\left(\sqrt{T\ln(K)} + \frac{\ln(K)}ε\right)$ minimax lower bound, where $K$ is the total number of actions and $Δ_{\min}$ denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an $ε$-differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra $\log(T)$ factor.