Structured and Fast Optimization: The Kronecker SGD Algorithm
This work addresses a critical bottleneck in training large-scale deep learning models, offering a potential speedup for applications in fields like natural language processing and computer vision, though it appears incremental as it builds on existing SGD frameworks with structural assumptions.
The paper tackles the efficiency challenge of scaling stochastic gradient descent (SGD) with large parameter sizes in deep learning by introducing a novel algorithm that exploits Kronecker product structures in input data, achieving a per-iteration computational cost that scales sublinearly with the parameter size and is independent of it for a two-layer neural network.
Stochastic gradient descent (SGD) now acts as a fundamental part of optimization in current machine learning. Meanwhile, deep learning architectures have shown outstanding performance in a wide range of fields, such as natural language processing, bioinformatics, and computer vision. Nevertheless, as the parameter size $d$ increases, these models encounter serious efficiency challenges. Previous studies show that the per step calculation expense scales linearly with the input size $d$. To mitigate this, our paper explores inherent patterns, such as Kronecker products within the training examples. We consider input data points that can be represented as tensor products of lower-dimensional vectors. We introduce a novel stochastic optimization method where the computational load for every update scales sublinearly with $d$, assuming moderate structural properties of the inputs. We believe our research is the first work achieving this result, representing a significant step forward for efficient deep learning optimization. Our theoretical findings are supported by a formal theorem, demonstrating that the proposed algorithm can train a two-layer fully connected neural network with a per-iteration cost independent of $d$.