LGJun 16, 2021

Masked Training of Neural Networks with Partial Gradients

arXiv:2106.08895v329 citations
Originality Incremental advance
AI Analysis

This work provides a theoretical foundation for improving efficiency in training algorithms, which is incremental but useful for researchers in optimization and deep learning.

The authors tackled the lack of theoretical convergence analysis for SGD variants like Extragradient and Dropout by proposing a unified framework, and demonstrated its utility by jointly training a low-rank Transformer with a standard one to achieve superior performance.

State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD). Recently, many variations have been explored: perturbing parameters for better accuracy (such as in Extragradient), limiting SGD updates to a subset of parameters for increased efficiency (such as meProp) or a combination of both (such as Dropout). However, the convergence of these methods is often not studied in theory. We propose a unified theoretical framework to study such SGD variants -- encompassing the aforementioned algorithms and additionally a broad variety of methods used for communication efficient training or model compression. Our insights can be used as a guide to improve the efficiency of such methods and facilitate generalization to new applications. As an example, we tackle the task of jointly training networks, a version of which (limited to sub-networks) is used to create Slimmable Networks. By training a low-rank Transformer jointly with a standard one we obtain superior performance than when it is trained separately.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes