LGMar 7, 2022Code
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter TransferGreg Yang, Edward J. Hu, Igor Babuschkin et al. · microsoft-research
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup and installable via `pip install mup`.
LGOct 26, 2023
A Spectral Condition for Feature LearningGreg Yang, James B. Simon, Jeremy Bernstein · mit
The push to train ever larger neural networks has motivated the study of initialization and training at large network width. A key challenge is to scale training so that a network's internal representations evolve nontrivially at all widths, a process known as feature learning. Here, we show that feature learning is achieved by scaling the spectral norm of weight matrices and their updates like $\sqrt{\texttt{fan-out}/\texttt{fan-in}}$, in contrast to widely used but heuristic scalings based on Frobenius norm and entry size. Our spectral scaling analysis also leads to an elementary derivation of \emph{maximal update parametrization}. All in all, we aim to provide the reader with a solid conceptual understanding of feature learning in neural networks.
LGAug 3, 2023
Tensor Programs IVb: Adaptive Optimization in the Infinite-Width LimitGreg Yang, Etai Littwin · microsoft-research
Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.
MLFeb 1, 2023
Width and Depth Limits Commute in Residual NetworksSoufiane Hayou, Greg Yang · microsoft-research
We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
MLMay 3, 2022
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the RepresentationJimmy Ba, Murat A. Erdogdu, Taiji Suzuki et al.
We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\topσ(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$. In the proportional asymptotic limit where $n,d,N\to\infty$ at the same rate, and an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the teacher model $f^*$. To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on $\boldsymbol{W}$ with learning rate $η$, when $f^*$ is a single-index model. We consider two scalings of the first step learning rate $η$. For small $η$, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large $η$, we prove that for certain $f^*$, the same ridge estimator on trained features can go beyond this "linear regime" and outperform a wide range of random features and rotationally invariant kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.
CLNov 4, 2021Code
CLUES: Few-Shot Learning Evaluation in Natural Language UnderstandingSubhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng et al.
Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded "human-level" performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers. To help accelerate this line of work, we introduce CLUES (Constrained Language Understanding Evaluation Standard), a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at https://github.com/microsoft/CLUES.
CVJun 7, 2021Code
3DB: A Framework for Debugging Computer Vision ModelsGuillaume Leclerc, Hadi Salman, Andrew Ilyas et al.
We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation. We demonstrate, through a wide range of use cases, that 3DB allows users to discover vulnerabilities in computer vision systems and gain insights into how models make decisions. 3DB captures and generalizes many robustness analyses from prior work, and enables one to study their interplay. Finally, we find that the insights generated by the system transfer to the physical world. We are releasing 3DB as a library (https://github.com/3db/3db) alongside a set of example analyses, guides, and documentation: https://3db.github.io/3db/ .
LGNov 30, 2020Code
Feature Learning in Infinite-Width Neural NetworksGreg Yang, Edward J. Hu
As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the *Tensor Programs* technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases. More generally, we classify a natural space of neural network parametrizations that generalizes standard, NTK, and Mean Field parametrizations. We show 1) any parametrization in this space either admits feature learning or has an infinite-width training dynamics given by kernel gradient descent, but not both; 2) any such infinite-width limit can be computed using the Tensor Programs technique. Code for our experiments can be found at github.com/edwardjhu/TP4.
MLJun 25, 2020Code
Tensor Programs II: Neural Tangent Kernel for Any ArchitectureGreg Yang
We prove that a randomly initialized neural network of *any architecture* has its Tangent Kernel (NTK) converge to a deterministic limit, as the network widths tend to infinity. We demonstrate how to calculate this limit. In prior literature, the heuristic study of neural network gradients often assumes every weight matrix used in forward propagation is independent from its transpose used in backpropagation (Schoenholz et al. 2017). This is known as the *gradient independence assumption (GIA)*. We identify a commonly satisfied condition, which we call *Simple GIA Check*, such that the NTK limit calculation based on GIA is correct. Conversely, when Simple GIA Check fails, we show GIA can result in wrong answers. Our material here presents the NTK results of Yang (2019a) in a friendly manner and showcases the *tensor programs* technique for understanding wide neural networks. We provide reference implementations of infinite-width NTKs of recurrent neural network, transformer, and batch normalization at https://github.com/thegregyang/NTK4A.
LGApr 26, 2020Code
Improved Image Wasserstein Attacks and DefensesEdward J. Hu, Adith Swaminathan, Hadi Salman et al.
Robustness against image perturbations bounded by a $\ell_p$ ball have been well-studied in recent literature. Perturbations in the real-world, however, rarely exhibit the pixel independence that $\ell_p$ threat models assume. A recently proposed Wasserstein distance-bounded threat model is a promising alternative that limits the perturbation to pixel mass movements. We point out and rectify flaws in previous definition of the Wasserstein threat model and explore stronger attacks and defenses under our better-defined framework. Lastly, we discuss the inability of current Wasserstein-robust models in defending against perturbations seen in the real world. Our code and trained models are available at https://github.com/edwardjhu/improved_wasserstein .
LGMar 4, 2020Code
Denoised Smoothing: A Provable Defense for Pretrained ClassifiersHadi Salman, Mingjie Sun, Greg Yang et al.
We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks. This method, for instance, allows public vision API providers and users to seamlessly convert pretrained non-robust classification services into provably robust ones. By prepending a custom-trained denoiser to any off-the-shelf image classifier and using randomized smoothing, we effectively create a new classifier that is guaranteed to be $\ell_p$-robust to adversarial examples, without modifying the pretrained classifier. Our approach applies to both the white-box and the black-box settings of the pretrained classifier. We refer to this defense as denoised smoothing, and we demonstrate its effectiveness through extensive experimentation on ImageNet and CIFAR-10. Finally, we use our approach to provably defend the Azure, Google, AWS, and ClarifAI image classification APIs. Our code replicating all the experiments in the paper can be found at: https://github.com/microsoft/denoised-smoothing.
LGFeb 19, 2020Code
Randomized Smoothing of All Shapes and SizesGreg Yang, Tony Duan, J. Edward Hu et al.
Randomized smoothing is the current state-of-the-art defense with provable robustness against $\ell_2$ adversarial attacks. Many works have devised new randomized smoothing schemes for other metrics, such as $\ell_1$ or $\ell_\infty$; however, substantial effort was needed to derive such new guarantees. This begs the question: can we find a general theory for randomized smoothing? We propose a novel framework for devising and analyzing randomized smoothing schemes, and validate its effectiveness in practice. Our theoretical contributions are: (1) we show that for an appropriate notion of "optimal", the optimal smoothing distributions for any "nice" norms have level sets given by the norm's *Wulff Crystal*; (2) we propose two novel and complementary methods for deriving provably robust radii for any smoothing distribution; and, (3) we show fundamental limits to current randomized smoothing techniques via the theory of *Banach space cotypes*. By combining (1) and (2), we significantly improve the state-of-the-art certified accuracy in $\ell_1$ on standard datasets. Meanwhile, we show using (3) that with only label statistics under random input perturbations, randomized smoothing cannot achieve nontrivial certified accuracy against perturbations of $\ell_p$-norm $Ω(\min(1, d^{\frac{1}{p} - \frac{1}{2}}))$, when the input dimension $d$ is large. We provide code in github.com/tonyduan/rs4a.
NEOct 28, 2019Code
Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian ProcessesGreg Yang
Wide neural networks with random weights and biases are Gaussian processes, as originally observed by Neal (1995) and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep fully-connected networks, as well as by Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional networks. We show that this Neural Network-Gaussian Process correspondence surprisingly extends to all modern feedforward or recurrent neural networks composed of multilayer perceptron, RNNs (e.g. LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization. More generally, we introduce a language for expressing neural network computations, and our result encompasses all such expressible neural networks. This work serves as a tutorial on the *tensor programs* technique formulated in Yang (2019) and elucidates the Gaussian Process results obtained there. We provide open-source implementations of the Gaussian Process kernels of simple RNN, GRU, transformer, and batchnorm+ReLU network at github.com/thegregyang/GP4A.
LGJul 24, 2019Code
A Fine-Grained Spectral Perspective on Neural NetworksGreg Yang, Hadi Salman
Are neural networks biased toward simple functions? Does depth always help learn more complex features? Is training the last layer of a network as good as training all layers? How to set the range for learning rate tuning? These questions seem unrelated at face value, but in this work we give all of them a common treatment from the spectral perspective. We will study the spectra of the *Conjugate Kernel, CK,* (also called the *Neural Network-Gaussian Process Kernel*), and the *Neural Tangent Kernel, NTK*. Roughly, the CK and the NTK tell us respectively "what a network looks like at initialization" and "what a network looks like during and after training." Their spectra then encode valuable information about the initial distribution and the training and generalization properties of neural networks. By analyzing the eigenvalues, we lend novel insights into the questions put forth at the beginning, and we verify these insights by extensive experiments of neural networks. We derive fast algorithms for computing the spectra of CK and NTK when the data is uniformly distributed over the boolean cube, and show this spectra is the same in high dimensions when data is drawn from isotropic Gaussian or uniformly over the sphere. Code replicating our results is available at github.com/thegregyang/NNspectra.
LGJun 9, 2019Code
Provably Robust Deep Learning via Adversarially Trained Smoothed ClassifiersHadi Salman, Greg Yang, Jerry Li et al.
Recent works have shown the effectiveness of randomized smoothing as a scalable technique for building neural network-based classifiers that are provably robust to $\ell_2$-norm adversarial perturbations. In this paper, we employ adversarial training to improve the performance of randomized smoothing. We design an adapted attack for smoothed classifiers, and we show how this attack can be used in an adversarial training setting to boost the provable robustness of smoothed classifiers. We demonstrate through extensive experimentation that our method consistently outperforms all existing provably $\ell_2$-robust classifiers by a significant margin on ImageNet and CIFAR-10, establishing the state-of-the-art for provable $\ell_2$-defenses. Moreover, we find that pre-training and semi-supervised learning boost adversarially trained smoothed classifiers even further. Our code and trained models are available at http://github.com/Hadisalman/smoothing-adversarial .
LGFeb 23, 2019Code
A Convex Relaxation Barrier to Tight Robustness Verification of Neural NetworksHadi Salman, Greg Yang, Huan Zhang et al.
Verification of neural networks enables us to gauge their robustness against adversarial attacks. Verification algorithms fall into two categories: exact verifiers that run in exponential time and relaxed verifiers that are efficient but incomplete. In this paper, we unify all existing LP-relaxed verifiers, to the best of our knowledge, under a general convex relaxation framework. This framework works for neural networks with diverse architectures and nonlinearities and covers both primal and dual views of robustness verification. We further prove strong duality between the primal and dual problems under very mild conditions. Next, we perform large-scale experiments, amounting to more than 22 CPU-years, to obtain exact solution to the convex-relaxed problem that is optimal within our framework for ReLU networks. We find the exact solution does not significantly improve upon the gap between PGD and existing relaxed verifiers for various networks trained normally or robustly on MNIST and CIFAR datasets. Our results suggest there is an inherent barrier to tight verification for the large class of methods captured by our framework. We discuss possible causes of this barrier and potential future directions for bypassing it. Our code and trained models are available at http://github.com/Hadisalman/robust-verify-benchmark .
LGMar 12, 2025
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P ParametrizationZixiang Chen, Greg Yang, Qingyue Zhao et al.
Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($μ$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
LGJul 1, 2021
Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with BottlenecksEtai Littwin, Omid Saremi, Shuangfei Zhai et al.
We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck. Unlike the neural tangent kernel limit, a bottleneck in an otherwise infinite width network al-lows data dependent feature learning in its bottle-neck representation. We empirically show that a single bottleneck in infinite networks dramatically accelerates training when compared to purely in-finite networks, with an improved overall performance. We discuss the acceleration phenomena by drawing similarities to infinitely wide deep linear models, where the acceleration effect of a bottleneck can be understood theoretically.
LGMay 8, 2021
Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training DynamicsGreg Yang, Etai Littwin
Yang (2020a) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as ResNet and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the *architectural universality* of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem. To facilitate this proof, we develop a graphical notation for Tensor Programs.
NESep 22, 2020
Tensor Programs III: Neural Matrix LawsGreg Yang
In a neural network (NN), *weight matrices* linearly transform inputs into *preactivations* that are then transformed nonlinearly into *activations*. A typical NN interleaves multitudes of such linear and nonlinear transforms to express complex functions. Thus, the (pre-)activations depend on the weights in an intricate manner. We show that, surprisingly, (pre-)activations of a randomly initialized NN become *independent* from the weights as the NN's widths tend to infinity, in the sense of asymptotic freeness in random matrix theory. We call this the Free Independence Principle (FIP), which has these consequences: 1) It rigorously justifies the calculation of asymptotic Jacobian singular value distribution of an NN in Pennington et al. [36,37], essential for training ultra-deep NNs [48]. 2) It gives a new justification of gradient independence assumption used for calculating the Neural Tangent Kernel of a neural network. FIP and these results hold for any neural architecture. We show FIP by proving a Master Theorem for any Tensor Program, as introduced in Yang [50,51], generalizing the Master Theorems proved in those works. As warmup demonstrations of this new Master Theorem, we give new proofs of the semicircle and Marchenko-Pastur laws, which benchmarks our framework against these fundamental mathematical results.
LGMar 27, 2020
On Infinite-Width HypernetworksEtai Littwin, Tomer Galanti, Lior Wolf et al.
{\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork's output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the {\em hyperkernel}. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks.
ACSep 5, 2019
Free resolutions of function classes via order complexesJustin Chen, Christopher Eur, Greg Yang et al.
Function classes are collections of Boolean functions on a finite set, which are fundamental objects of study in theoretical computer science. We study algebraic properties of ideals associated to function classes previously defined by the third author. We consider the broad family of intersection-closed function classes, and describe cellular free resolutions of their ideals by order complexes of the associated posets. For function classes arising from matroids, polyhedral cell complexes, and more generally interval Cohen-Macaulay posets, we show that the multigraded Betti numbers are pure, and are given combinatorially by the Möbius functions. We then apply our methods to derive bounds on the VC dimension of some important families of function classes in learning theory.
NEFeb 21, 2019
A Mean Field Theory of Batch NormalizationGreg Yang, Jeffrey Pennington, Vinay Rao et al.
We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range. Our theory leverages Laplace, Fourier, and Gegenbauer transforms and we derive new identities that may be of independent interest.
NEFeb 13, 2019
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel DerivationGreg Yang
Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows (1) the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; (2) conditions under which the \emph{gradient independence assumption} -- that weights in backpropagation can be assumed to be independent from weights in the forward pass -- leads to correct computation of gradient dynamics, and corrections when it does not; (3) the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.
AIFeb 12, 2019
NAIL: A General Interactive Fiction AgentMatthew Hausknecht, Ricky Loynd, Greg Yang et al.
Interactive Fiction (IF) games are complex textual decision making problems. This paper introduces NAIL, an autonomous agent for general parser-based IF games. NAIL won the 2018 Text Adventure AI Competition, where it was evaluated on twenty unseen games. This paper describes the architecture, development, and insights underpinning NAIL's performance.
LGJan 25, 2019
Dynamical Isometry and a Mean Field Theory of LSTMs and GRUsDar Gilboa, Bo Chang, Minmin Chen et al.
Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and the GRU, do exhibit modest improvements over vanilla RNN cells, but they still suffer from instabilities when trained on very long sequences. In this work, we develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. By optimizing these quantities in terms of the initialization hyperparameters, we derive a novel initialization scheme that eliminates or reduces training instabilities. We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower. We also observe a beneficial effect on generalization performance using this new initialization.
MLOct 11, 2018
Bayesian Deep Convolutional Networks with Many Channels are Gaussian ProcessesRoman Novak, Lechao Xiao, Jaehoon Lee et al.
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.
NEDec 24, 2017
Mean Field Residual Networks: On the Edge of ChaosGreg Yang, Samuel S. Schoenholz
We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.
ACJan 9, 2017
A Homological Theory of FunctionsGreg Yang
In computational complexity, a complexity class is given by a set of problems or functions, and a basic challenge is to show separations of complexity classes $A \not= B$ especially when $A$ is known to be a subset of $B$. In this paper we introduce a homological theory of functions that can be used to establish complexity separations, while also providing other interesting consequences. We propose to associate a topological space $S_A$ to each class of functions $A$, such that, to separate complexity classes $A \subseteq B'$, it suffices to observe a change in "the number of holes", i.e. homology, in $S_A$ as a subclass $B$ of $B'$ is added to $A$. In other words, if the homologies of $S_A$ and $S_{A \cup B}$ are different, then $A \not= B'$. We develop the underlying theory of functions based on combinatorial and homological commutative algebra and Stanley-Reisner theory, and recover Minsky and Papert's 1969 result that parity cannot be computed by nonmaximal degree polynomial threshold functions. In the process, we derive a "maximal principle" for polynomial threshold functions that is used to extend this result further to arbitrary symmetric functions. A surprising coincidence is demonstrated, where the maximal dimension of "holes" in $S_A$ upper bounds the VC dimension of $A$, with equality for common computational cases such as the class of polynomial threshold functions or the class of linear functionals in $\mathbb F_2$, or common algebraic cases such as when the Stanley-Reisner ring of $S_A$ is Cohen-Macaulay. As another interesting application of our theory, we prove a result that a priori has nothing to do with complexity separation: it characterizes when a vector subspace intersects the positive cone, in terms of homological conditions. By analogy to Farkas' result doing the same with *linear conditions*, we call our theorem the Homological Farkas Lemma.
NENov 9, 2016
Lie-Access Neural Turing MachinesGreg Yang, Alexander M. Rush
External neural memory structures have recently become a popular tool for algorithmic deep learning (Graves et al. 2014, Weston et al. 2014). These models generally utilize differentiable versions of traditional discrete memory-access structures (random access, stacks, tapes) to provide the storage necessary for computational tasks. In this work, we argue that these neural memory systems lack specific structure important for relative indexing, and propose an alternative model, Lie-access memory, that is explicitly designed for the neural setting. In this paradigm, memory is accessed using a continuous head in a key-space manifold. The head is moved via Lie group actions, such as shifts or rotations, generated by a controller, and memory access is performed by linear smoothing in key space. We argue that Lie groups provide a natural generalization of discrete memory structures, such as Turing machines, as they provide inverse and identity operators while maintaining differentiability. To experiment with this approach, we implement a simplified Lie-access neural Turing machine (LANTM) with different Lie groups. We find that this approach is able to perform well on a range of algorithmic tasks.
NEFeb 28, 2016
Lie Access Neural Turing MachineGreg Yang
Following the recent trend in explicit neural memory structures, we present a new design of an external memory, wherein memories are stored in an Euclidean key space $\mathbb R^n$. An LSTM controller performs read and write via specialized read and write heads. It can move a head by either providing a new address in the key space (aka random access) or moving from its previous position via a Lie group action (aka Lie access). In this way, the "L" and "R" instructions of a traditional Turing Machine are generalized to arbitrary elements of a fixed Lie group action. For this reason, we name this new model the Lie Access Neural Turing Machine, or LANTM. We tested two different configurations of LANTM against an LSTM baseline in several basic experiments. We found the right configuration of LANTM to outperform the baseline in all of our experiments. In particular, we trained LANTM on addition of $k$-digit numbers for $2 \le k \le 16$, but it was able to generalize almost perfectly to $17 \le k \le 32$, all with the number of parameters 2 orders of magnitude below the LSTM baseline.
LOOct 12, 2014
Computabilities of Validity and Satisfiability in Probability Logics over Finite and Countable ModelsGreg Yang
The $ε$-logic (which is called $ε$E-logic in this paper) of Kuyper and Terwijn is a variant of first order logic with the same syntax, in which the models are equipped with probability measures and in which the $\forall x$ quantifier is interpreted as "there exists a set $A$ of measure $\ge 1 - ε$ such that for each $x \in A$, ...." Previously, Kuyper and Terwijn proved that the general satisfiability and validity problems for this logic are, i) for rational $ε\in (0, 1)$, respectively $Σ^1_1$-complete and $Π^1_1$-hard, and ii) for $ε= 0$, respectively decidable and $Σ^0_1$-complete. The adjective "general" here means "uniformly over all languages." We extend these results in the scenario of finite models. In particular, we show that the problems of satisfiability by and validity over finite models in $ε$E-logic are, i) for rational $ε\in (0, 1)$, respectively $Σ^0_1$- and $Π^0_1$-complete, and ii) for $ε= 0$, respectively decidable and $Π^0_1$-complete. Although partial results toward the countable case are also achieved, the computability of $ε$E-logic over countable models still remains largely unsolved. In addition, most of the results, of this paper and of Kuyper and Terwijn, do not apply to individual languages with a finite number of unary predicates. Reducing this requirement continues to be a major point of research. On the positive side, we derive the decidability of the corresponding problems for monadic relational languages --- equality- and function-free languages with finitely many unary and zero other predicates. This result holds for all three of the unrestricted, the countable, and the finite model cases. Applications in computational learning theory, weighted graphs, and neural networks are discussed in the context of these decidability and undecidability results.