LGSep 5, 2023

Asymmetric Momentum: A Rethinking of Gradient Descent

Gongyue Zhang, Dinghuang Zhang, Shuwen Zhao, Donghan Liu, Carrie M. Toptan, Honghai Liu

arXiv:2309.02130v22.01 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses optimization efficiency for machine learning practitioners by offering a novel method that adapts to all gradient types, though it is incremental in rethinking momentum strategies.

The paper tackles the problem of gradient descent optimization by proposing Loss-Controlled Asymmetric Momentum (LCAM), which accelerates different parameters based on loss phases and gradient sparsity, achieving equal or better test accuracy on Cifar10 and Cifar100 with nearly half the training epochs compared to SGD.

Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we observe that in non-sparse gradients, frequently-changing parameters should actually be accelerated, which is completely opposite to traditional adaptive perspectives. Compared to traditional SGD with momentum, this algorithm separates the weights without additional computational costs. It is noteworthy that this method relies on the network's ability to extract complex features. We primarily use Wide Residual Networks for our research, employing the classic datasets Cifar10 and Cifar100 to test the ability for feature separation and conclude phenomena that are much more important than just accuracy rates. Finally, compared to classic SGD tuning methods, while using WRN on these two datasets and with nearly half the training epochs, we achieve equal or better test accuracy.

View on arXiv PDF Code

Similar