LG DS MLOct 9, 2021

Does Preprocessing Help Training Over-parameterized Neural Networks?

arXiv:2110.04622v117.252 citations

Originality Highly original

AI Analysis

This work addresses the fundamental challenge of efficient training for deep neural networks, offering theoretical speedups that could benefit machine learning practitioners dealing with large-scale models.

The paper tackles the problem of reducing the computational cost of training over-parameterized neural networks by proposing two preprocessing methods: one on initial weights to achieve cost per iteration of O~(m^(1-Θ(1/d)) n d) and another on input data to achieve O~(m^(4/5) n d), bypassing the classical Ω(m n d) barrier.

Deep neural networks have achieved impressive performance in many areas. Designing a fast and provable method for training neural networks is a fundamental question in machine learning. The classical training method requires paying $Ω(mnd)$ cost for both forward computation and backward computation, where $m$ is the width of the neural network, and we are given $n$ training points in $d$-dimensional space. In this paper, we propose two novel preprocessing ideas to bypass this $Ω(mnd)$ barrier: $\bullet$ First, by preprocessing the initial weights of the neural networks, we can train the neural network in $\widetilde{O}(m^{1-Θ(1/d)} n d)$ cost per iteration. $\bullet$ Second, by preprocessing the input data points, we can train the neural network in $\widetilde{O} (m^{4/5} nd )$ cost per iteration. From the technical perspective, our result is a sophisticated combination of tools in different fields, greedy-type convergence analysis in optimization, sparsity observation in practical work, high-dimensional geometric search in data structure, concentration and anti-concentration in probability. Our results also provide theoretical insights for a large number of previously established fast training methods. In addition, our classical algorithm can be generalized to the Quantum computation model. Interestingly, we can get a similar sublinear cost per iteration but avoid preprocessing initial weights or input data points.

View on arXiv PDF

Similar