Asynchronous Heavy-Tailed Optimization
This work addresses a specific bottleneck in asynchronous optimization for machine learning practitioners dealing with heavy-tailed noise, offering incremental improvements over existing methods.
The paper tackled the problem of heavy-tailed gradient noise destabilizing optimization in transformer models, particularly in asynchronous settings, by proposing delay-aware learning rate scheduling and delay compensation modifications. The result was convergence guarantees matching synchronous rates, improved delay tolerance, and empirical outperformance in accuracy/runtime trade-offs and robustness on image and language tasks.
Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.