LGOCMLMay 23, 2019

Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

arXiv:1905.09899v16 citations
Originality Incremental advance
AI Analysis

This addresses the generalization issue in deep learning optimization for practitioners, offering a method that balances speed and accuracy, though it is incremental as it builds on existing adaptive techniques.

The paper tackles the problem of adaptive optimization methods in deep learning, which converge quickly but generalize poorly, by proposing blockwise adaptive gradient descent that splits parameters into blocks for adaptive stepsize. The result is faster convergence up to a constant factor and improved generalization performance over methods like Adam and Nesterov's accelerated gradient.

Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this paper, by revisiting the design of Adagrad, we propose to split the network parameters into blocks, and use a blockwise adaptive stepsize. Intuitively, blockwise adaptivity is less aggressive than adaptivity to individual coordinates, and can have a better balance between adaptivity and generalization. We show theoretically that the proposed blockwise adaptive gradient descent has comparable convergence rate as its counterpart with coordinate-wise adaptive stepsize, but is faster up to some constant. We also study its uniform stability and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity. Experimental results show that blockwise adaptive gradient descent converges faster and improves generalization performance over Nesterov's accelerated gradient and Adam.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes