LGMLAug 22, 2018

Don't Use Large Mini-Batches, Use Local SGD

arXiv:1808.07217v6466 citations
Originality Incremental advance
AI Analysis

This addresses a major roadblock in scalable deep learning for practitioners, offering a solution to maintain generalization without sacrificing training speed.

The paper tackles the problem of poor generalization in distributed deep learning when using large mini-batches, proposing post-local SGD to improve accuracy on new data while maintaining efficiency and scalability, with significant gains shown on standard benchmarks.

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes