Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization
It addresses decentralized optimization in federated learning on mobile devices, offering incremental improvements in convergence efficiency for time-varying networks.
This paper extends accelerated gradient tracking to time-varying graphs for decentralized optimization, proving improved convergence rates of O((γ/(1-σ_γ))^2√(L/ε)) for non-strongly convex problems and O((γ/(1-σ_γ))^1.5√(L/μ)log(1/ε)) for strongly convex problems, which are significantly better than previous rates for static graphs.
Decentralized optimization over time-varying graphs has been increasingly common in modern machine learning with massive data stored on millions of mobile devices, such as in federated learning. This paper revisits the widely used accelerated gradient tracking and extends it to time-varying graphs. We prove that the practical single loop accelerated gradient tracking needs $O((\fracγ{1-σ_γ})^2\sqrt{\frac{L}ε})$ and $O((\fracγ{1-σ_γ})^{1.5}\sqrt{\frac{L}μ}\log\frac{1}ε)$ iterations to reach an $ε$-optimal solution over time-varying graphs when the problems are nonstrongly convex and strongly convex, respectively, where $γ$ and $σ_γ$ are two common constants charactering the network connectivity, $L$ and $μ$ are the smoothness and strong convexity constants, respectively, and one iteration corresponds to one gradient oracle call and one communication round. Our convergence rates improve significantly over the ones of $O(\frac{1}{ε^{5/7}})$ and $O((\frac{L}μ)^{5/7}\frac{1}{(1-σ)^{1.5}}\log\frac{1}ε)$, respectively, which were proved in the original literature of accelerated gradient tracking only for static graphs, where $\fracγ{1-σ_γ}$ equals $\frac{1}{1-σ}$ when the network is time-invariant. When combining with a multiple consensus subroutine, the dependence on the network connectivity constants can be further improved to $O(1)$ and $O(\fracγ{1-σ_γ})$ for the gradient oracle and communication round complexities, respectively. When the network is static, by employing the Chebyshev acceleration, our complexities exactly match the lower bounds without hiding any poly-logarithmic factor for both nonstrongly convex and strongly convex problems.