LGOCMLMay 26, 2019

Stochastic Gradient Methods with Block Diagonal Matrix Adaptation

arXiv:1905.10757v12 citations
Originality Highly original
AI Analysis

This addresses the problem of inefficient optimization in deep learning for practitioners, offering a practical solution with significant performance gains, though it is incremental as it builds on existing adaptive gradient methods.

The paper tackles the computational burden of full-matrix adaptation in deep learning by proposing block-diagonal matrix adaptation, which improves convergence and generalization, achieving state-of-the-art results on several tasks and outperforming existing methods.

Adaptive gradient approaches that automatically adjust the learning rate on a per-feature basis have been very popular for training deep networks. This rich class of algorithms includes Adagrad, RMSprop, Adam, and recent extensions. All these algorithms have adopted diagonal matrix adaptation, due to the prohibitive computational burden of manipulating full matrices in high-dimensions. In this paper, we show that block-diagonal matrix adaptation can be a practical and powerful solution that can effectively utilize structural characteristics of deep learning architectures, and significantly improve convergence and out-of-sample generalization. We present a general framework with block-diagonal matrix updates via coordinate grouping, which includes counterparts of the aforementioned algorithms, prove their convergence in non-convex optimization, highlighting benefits compared to diagonal versions. In addition, we propose an efficient spectrum-clipping scheme that benefits from superior generalization performance of Sgd. Extensive experiments reveal that block-diagonal approaches achieve state-of-the-art results on several deep learning tasks, and can outperform adaptive diagonal methods, vanilla Sgd, as well as a modified version of full-matrix adaptation proposed very recently.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes