Michael Winer

DIS-NN
3papers
6citations
Novelty45%
AI Score40

3 Papers

87.0MLMay 22
Asymmetric Scaling Laws from Sparse Features

John Sous, Michael Winer

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

DIS-NNAug 12, 2024
Neural Networks as Spin Models: From Glass to Hidden Order Through Training

Richard Barney, Michael Winer, Victor Galitski

We explore a one-to-one correspondence between a neural network (NN) and a statistical mechanical spin model where neurons are mapped to Ising spins and weights to spin-spin couplings. The process of training an NN produces a family of spin Hamiltonians parameterized by training time. We study the magnetic phases and the melting transition temperature as training progresses. First, we prove analytically that the common initial state before training--an NN with independent random weights--maps to a layered version of the classical Sherrington-Kirkpatrick spin glass exhibiting a replica symmetry breaking. The spin-glass-to-paramagnet transition temperature is calculated. Further, we use the Thouless-Anderson-Palmer (TAP) equations--a theoretical technique to analyze the landscape of energy minima of random systems--to determine the evolution of the magnetic phases on two types of NNs (one with continuous and one with binarized activations) trained on the MNIST dataset. The two NN types give rise to similar results, showing a quick destruction of the spin glass and the appearance of a phase with a hidden order, whose melting transition temperature $T_c$ grows as a power law in training time. We also discuss the properties of the spectrum of the spin system's bond matrix in the context of rich vs. lazy learning. We suggest that this statistical mechanical view of NNs provides a useful unifying perspective on the training process, which can be viewed as selecting and strengthening a symmetry-broken state associated with the training task.

50.2LGMay 6
Estimating the expected output of wide random MLPs more efficiently than sampling

Wilson Wu, Victor Lecomte, Michael Winer et al.

By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily optimal. Given an MLP at initialization, we show how to estimate its expected output over Gaussian inputs without running samples through the network at all. Instead, we produce approximate representations of the distributions of activations at each layer, leveraging tools such as cumulants and Hermite expansions. We show both theoretically and empirically that for sufficiently wide networks, our estimator achieves a target mean squared error using substantially fewer FLOPs than Monte Carlo sampling. We find moreover that our methods perform particularly well at estimating the probabilities of rare events, and additionally demonstrate how they can be used for model training. Together, these findings suggest a path to producing models with a greatly reduced probability of catastrophic tail risks.