LGJun 5

Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

arXiv:2606.0693415.7

Originality Incremental advance

AI Analysis

For machine learning practitioners using low-precision training, this work reveals that quantization can fundamentally alter generalization behavior, with implications for model deployment on resource-constrained hardware.

The paper analyzes the generalization error and stability of gradient descent and stochastic gradient descent when parameters are constrained to fixed-point (discrete) representations. It shows that deterministic rounding degrades GD's generalization error from O(T/n) to O(T/√n) and makes stability bounds vacuous, while SGD with deterministic rounding retains nontrivial stability with dimension-dependent rates.

We analyze generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces, where each update involves deterministic or stochastic rounding. We show that deterministic rounding degrades the generalization error of GD on convex, Lipschitz, and smooth loss functions, increasing the rate from $O(T/n)$ to $O(T/\sqrt{n})$, and establish matching lower bounds. We further prove that uniform stability of GD becomes $Ω(T)$, showing that stability-based generalization bounds are vacuous in this setting. In contrast, for the same losses, stochastic gradient descent with deterministic rounding admits nontrivial uniform stability guarantees, which differ qualitatively from the real-valued case and exhibit distinct dependencies on the number of iterations and the dimension: we prove tight bounds $O(T/n)$ for one dimension and $O(T^2/n)$ for higher dimensions. We also show that stochastic rounding can introduce generalization error that increases with the dimension; such a phenomenon is absent in standard real-valued optimization and in the deterministic rounding case. Finally, we provide upper bounds on uniform argument stability for stochastic rounding schemes and show that these bounds are tight when the loss can be represented as a sum of coordinate-wise functions.

View on arXiv PDF

Similar