LG OCMay 29, 2022

Stochastic Zeroth Order Gradient and Hessian Estimators: Variance Reduction and Refined Bias Bounds

arXiv:2205.14737v35.817 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses variance reduction in zeroth-order optimization methods, which is crucial for applications like black-box optimization where gradient information is unavailable, but it is incremental as it builds on existing finite-difference estimators.

The paper tackles the problem of high variance in stochastic zeroth-order gradient and Hessian estimators for functions in ℝⁿ by using random orthogonal directions, showing that with k directions and granularity δ, gradient variance is bounded by O((n/k - 1) + (n²/k - n)δ² + n²δ⁴/k) and Hessian variance by O((n²/k² - 1) + (n⁴/k² - n²)δ² + n⁴δ⁴/k²), with negligible variance when k=n, and provides improved bias bounds of O(δ²Γ).

We study stochastic zeroth order gradient and Hessian estimators for real-valued functions in $\mathbb{R}^n$. We show that, via taking finite difference along random orthogonal directions, the variance of the stochastic finite difference estimators can be significantly reduced. In particular, we design estimators for smooth functions such that, if one uses $ Θ\left( k \right) $ random directions sampled from the Stiefel's manifold $ \text{St} (n,k) $ and finite-difference granularity $δ$, the variance of the gradient estimator is bounded by $ \mathcal{O} \left( \left( \frac{n}{k} - 1 \right) + \left( \frac{n^2}{k} - n \right) δ^2 + \frac{ n^2 δ^4 }{ k } \right) $, and the variance of the Hessian estimator is bounded by $\mathcal{O} \left( \left( \frac{n^2}{k^2} - 1 \right) + \left( \frac{n^4}{k^2} - n^2 \right) δ^2 + \frac{n^4 δ^4 }{k^2} \right) $. When $k = n$, the variances become negligibly small. In addition, we provide improved bias bounds for the estimators. The bias of both gradient and Hessian estimators for smooth function $f$ is of order $\mathcal{O} \left( δ^2 Γ\right)$, where $δ$ is the finite-difference granularity, and $ Γ$ depends on high order derivatives of $f$. Our results are evidenced by empirical observations.

View on arXiv PDF Code

Similar