Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation
This work addresses convergence issues in adversarial training for machine learning, providing theoretical guarantees that bridge a gap between existing edge cases, though it is incremental as it builds on prior analyses.
The paper tackles the problem of convergence in two-player non-convex, non-concave zero-sum games using gradient descent-ascent, showing that a finite timescale separation parameter ensures convergence to strict local minmax equilibria, with explicit construction for computing this parameter and empirical validation on datasets like CIFAR-10 and CelebA.
We study the role that a finite timescale separation parameter $τ$ has on gradient descent-ascent in two-player non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by $γ_1$ and the learning rate of player 2 is defined to be $γ_2=τγ_1$. Existing work analyzing the role of timescale separation in gradient descent-ascent has primarily focused on the edge cases of players sharing a learning rate ($τ=1$) and the maximizing player approximately converging between each update of the minimizing player ($τ\rightarrow \infty$). For the parameter choice of $τ=1$, it is known that the learning dynamics are not guaranteed to converge to a game-theoretically meaningful equilibria in general. In contrast, Jin et al. (2020) showed that the stable critical points of gradient descent-ascent coincide with the set of strict local minmax equilibria as $τ\rightarrow\infty$. In this work, we bridge the gap between past work by showing there exists a finite timescale separation parameter $τ^{\ast}$ such that $x^{\ast}$ is a stable critical point of gradient descent-ascent for all $τ\in (τ^{\ast}, \infty)$ if and only if it is a strict local minmax equilibrium. Moreover, we provide an explicit construction for computing $τ^{\ast}$ along with corresponding convergence rates and results under deterministic and stochastic gradient feedback. The convergence results we present are complemented by a non-convergence result: given a critical point $x^{\ast}$ that is not a strict local minmax equilibrium, then there exists a finite timescale separation $τ_0$ such that $x^{\ast}$ is unstable for all $τ\in (τ_0, \infty)$. Finally, we empirically demonstrate on the CIFAR-10 and CelebA datasets the significant impact timescale separation has on training performance.