LGAug 16, 2024
A Mean Field Ansatz for Zero-Shot Weight TransferXingyuan Chen, Wenwei Kuang, Lei Deng et al.
The pre-training cost of large language models (LLMs) is prohibitive. One cutting-edge approach to reduce the cost is zero-shot weight transfer, also known as model growth for some cases, which magically transfers the weights trained in a small model to a large model. However, there are still some theoretical mysteries behind the weight transfer. In this paper, inspired by prior applications of mean field theory to neural network dynamics, we introduce a mean field ansatz to provide a theoretical explanation for weight transfer. Specifically, we propose the row-column (RC) ansatz under the mean field point of view, which describes the measure structure of the weights in the neural network (NN) and admits a close measure dynamic. Thus, the weights of different sizes NN admit a common distribution under proper assumptions, and weight transfer methods can be viewed as sampling methods. We empirically validate the RC ansatz by exploring simple MLP examples and LLMs such as GPT-3 and Llama-3.1. We show the mean-field point of view is adequate under suitable assumptions which can provide theoretical support for zero-shot weight transfer.
PRApr 29, 2019
An unbiased Ito type stochastic representation for transport PDEs: A Toy ExampleGoncalo dos Reis, Greig Smith
We propose a stochastic representation for a simple class of transport PDEs based on Ito representations. We detail an algorithm using an estimator stemming for the representation that, unlike regularization by noise estimators, is unbiased. We rely on recent developments on branching diffusions, regime switching processes and their representations of PDEs. There is a loose relation between our technique and regularization by noise, but contrary to the latter, we add a perturbation and immediately its correction. The method is only possible through a judicious choice of the diffusion coefficient $σ$. A key feature is that our approach does not rely on the smallness of $σ$, in fact, our $σ$ is strictly bounded from below which is in stark contrast with standard perturbation techniques. This is critical for extending this method to non-toy PDEs which have nonlinear terms in the first derivative where the usual perturbation technique breaks down. The examples presented show the algorithm outperforming alternative approaches. Moreover, the examples point toward a potential algorithm for the fully nonlinear case where the method of characteristics breaks down.