Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes
This provides theoretical guarantees for gradient descent in logistic regression, addressing convergence issues for practitioners in machine learning, though it is incremental as it builds on existing methods like the Perceptron.
The paper tackles the problem of gradient descent for logistic regression on linearly separable data by showing that with large, adaptive stepsizes, it achieves an arbitrarily small risk after at most 1/γ² burn-in steps, where γ is the dataset margin, and proves this is minimax optimal among first-order batch methods.
We study $\textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $η$. We show that after at most $1/γ^2$ burn-in steps, GD achieves a risk upper bounded by $\exp(-Θ(η))$, where $γ$ is the margin of the dataset. As $η$ can be arbitrarily large, GD attains an arbitrarily small risk $\textit{immediately after the burn-in steps}$, though the risk evolution may be $\textit{non-monotonic}$. We further construct hard datasets with margin $γ$, where any batch (or online) first-order method requires $Ω(1/γ^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $\textit{minimax optimal}$ among first-order batch methods. Notably, the classical $\textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/γ^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.