Thermodynamic Natural Gradient Descent
This addresses the hardware limitation for large-scale neural network training, offering a potential speed-up for practitioners, though it is incremental as it builds on existing NGD concepts with a novel hardware implementation.
The paper tackles the computational overhead of second-order training methods like natural gradient descent (NGD) by proposing a hybrid digital-analog algorithm that leverages thermodynamic properties to achieve similar per-iteration complexity as first-order methods, demonstrating superiority over state-of-the-art methods on classification and language model fine-tuning tasks with numerical results.
Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.