LGAIMay 20, 2023

A Framework for Provably Stable and Consistent Training of Deep Feedforward Networks

arXiv:2305.12125v1
Originality Incremental advance
AI Analysis

This work addresses stability and consistency issues in deep learning training, which is crucial for researchers and practitioners in machine learning, though it appears incremental as it builds on existing methods like gradient clipping and squashing activations.

The authors tackled the problem of numerical instability and high variance in training deep neural networks by introducing a novel algorithm that combines stochastic gradient descent with gradient clipping on the output layer, proving that networks with squashing activations become stable and consistent. They demonstrated that this approach eliminates the need for target networks in Deep Q-Learning, speeding up learning and reducing memory usage, and improves consistency in classification algorithms, with experiments showing low variance updates and smooth loss reduction.

We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios. This algorithm combines the standard stochastic gradient descent and the gradient clipping method. The output layer is updated using clipped gradients, the rest of the neural network is updated using standard gradients. Updating the output layer using clipped gradient stabilizes it. We show that the remaining layers are automatically stabilized provided the neural network is only composed of squashing (compact range) activations. We also present a novel squashing activation function - it is obtained by modifying a Gaussian Error Linear Unit (GELU) to have compact range - we call it Truncated GELU (tGELU). Unlike other squashing activations, such as sigmoid, the range of tGELU can be explicitly specified. As a consequence, the problem of vanishing gradients that arise due to a small range, e.g., in the case of a sigmoid activation, is eliminated. We prove that a NN composed of squashing activations (tGELU, sigmoid, etc.), when updated using the algorithm presented herein, is numerically stable and has consistent performance (low variance). The theory is supported by extensive experiments. Within reinforcement learning, as a consequence of our study, we show that target networks in Deep Q-Learning can be omitted, greatly speeding up learning and alleviating memory requirements. Cross-entropy based classification algorithms that suffer from high variance issues are more consistent when trained using our framework. One symptom of numerical instability in training is the high variance of the neural network update values. We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes