LGDec 11, 2024

From Logistic Regression to the Perceptron Algorithm: Exploring Gradient Descent with Large Step Sizes

arXiv:2412.08424v13 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses optimization theory gaps for classification problems, offering insights for practitioners but is incremental in method development.

The paper investigates logistic regression with gradient descent (LR+GD) using large step sizes on separable datasets, showing it reduces to a batch perceptron algorithm and that larger step sizes increase logistic loss but speed convergence, making loss an unreliable metric. It proposes Normalized LR+GD to improve theoretical guarantees.

We focus on the classification problem with a separable dataset, one of the most important and classical problems from machine learning. The standard approach to this task is logistic regression with gradient descent (LR+GD). Recent studies have observed that LR+GD can find a solution with arbitrarily large step sizes, defying conventional optimization theory. Our work investigates this phenomenon and makes three interconnected key observations about LR+GD with large step sizes. First, we find a remarkably simple explanation of why LR+GD with large step sizes solves the classification problem: LR+GD reduces to a batch version of the celebrated perceptron algorithm when the step size $γ\to \infty.$ Second, we observe that larger step sizes lead LR+GD to higher logistic losses when it tends to the perceptron algorithm, but larger step sizes also lead to faster convergence to a solution for the classification problem, meaning that logistic loss is an unreliable metric of the proximity to a solution. Surprisingly, high loss values can actually indicate faster convergence. Third, since the convergence rate in terms of loss function values of LR+GD is unreliable, we examine the iteration complexity required by LR+GD with large step sizes to solve the classification problem and prove that this complexity is suboptimal. To address this, we propose a new method, Normalized LR+GD - based on the connection between LR+GD and the perceptron algorithm - with much better theoretical guarantees.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes