LGAIOct 27, 2025

Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

arXiv:2510.23012v111 citationsh-index: 1Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This provides a tighter theoretical guarantee for softmax, which is widely used in machine learning, potentially improving robustness and convergence analyses in applications like attention mechanisms and reinforcement learning, though it is incremental as it refines an existing bound.

The paper proves that the softmax function has a Lipschitz constant of 1/2 uniformly across all ℓ_p norms with p ≥ 1, improving upon the previously assumed constant of 1 for ℓ_2 norm, and demonstrates how this sharper bound enhances theoretical results on robustness and convergence.

The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes