Dyson Brownian motion and random matrix dynamics of weight matrices during learning
This provides theoretical insights into training dynamics for machine learning researchers, but it is incremental as it applies existing random matrix theory concepts to known training processes.
The paper tackled the problem of understanding the stochastic dynamics of weight matrices during training by applying random matrix theory, showing that the dynamics can be described using Dyson Brownian motion and explaining the linear scaling rule between learning rate and mini-batch size, with verification in a restricted Boltzmann machine and analysis of transformers.
During training, weight matrices in machine learning architectures are updated using stochastic gradient descent or variations thereof. In this contribution we employ concepts of random matrix theory to analyse the resulting stochastic matrix dynamics. We first demonstrate that the dynamics can generically be described using Dyson Brownian motion, leading to e.g. eigenvalue repulsion. The level of stochasticity is shown to depend on the ratio of the learning rate and the mini-batch size, explaining the empirically observed linear scaling rule. We verify this linear scaling in the restricted Boltzmann machine. Subsequently we study weight matrix dynamics in transformers (a nano-GPT), following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.