A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent
This provides a precise analysis of convergence speed for distributed optimization, which is incremental but important for applications in networked systems like federated learning.
The paper tackles the problem of minimizing the average of n cost functions over a network using distributed stochastic gradient descent (DSGD) with noisy gradients, showing that DSGD asymptotically achieves the optimal network-independent convergence rate for strongly convex and smooth functions. The main result is characterizing the transient time to approach this rate as O(n/(1-ρ_w)^2), with a matching lower bound proving sharpness.
This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-ρ_w)^2}\right)$, where $1-ρ_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $Ω\left(\frac{n}{(1-ρ_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.