LGJun 20, 2019
Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothnessAntônio H. Ribeiro, Koen Tiels, Luis A. Aguirre et al.
The exploding and vanishing gradient problem has been the major conceptual principle behind most architecture and training improvements in recurrent neural networks (RNNs) during the last decade. In this paper, we argue that this principle, while powerful, might need some refinement to explain recent developments. We refine the concept of exploding gradients by reformulating the problem in terms of the cost function smoothness, which gives insight into higher-order derivatives and the existence of regions with many close local minima. We also clarify the distinction between vanishing gradients and the need for the RNN to learn attractors to fully use its expressive power. Through the lens of these refinements, we shed new light on recent developments in the RNN field, namely stable RNN and unitary (or orthogonal) RNNs.
SYMay 2, 2019
On the smoothness of nonlinear system identificationAntônio H. Ribeiro, Koen Tiels, Jack Umenberger et al.
We shed new light on the \textit{smoothness} of optimization problems arising in prediction error parameter estimation of linear and nonlinear systems. We show that for regions of the parameter space where the model is not contractive, the Lipschitz constant and $β$-smoothness of the objective function might blow up exponentially with the simulation length, making it hard to numerically find minima within those regions or, even, to escape from them. In addition to providing theoretical understanding of this problem, this paper also proposes the use of multiple shooting as a viable solution. The proposed method minimizes the error between a prediction model and the observed values. Rather than running the prediction model over the entire dataset, multiple shooting splits the data into smaller subsets and runs the prediction model over each subset, making the simulation length a design parameter and making it possible to solve problems that would be infeasible using a standard approach. The equivalence to the original problem is obtained by including constraints in the optimization. The new method is illustrated by estimating the parameters of nonlinear systems with chaotic or unstable behavior, as well as neural networks. We also present a comparative analysis of the proposed method with multi-step-ahead prediction error minimization.
SYOct 2, 2017
Lasso Regularization Paths for NARMAX Models via Coordinate DescentAntônio H. Ribeiro, Luis A. Aguirre
We propose a new algorithm for estimating NARMAX models with $L_1$ regularization for models represented as a linear combination of basis functions. Due to the $L_1$-norm penalty the Lasso estimation tends to produce some coefficients that are exactly zero and hence gives interpretable models. The novelty of the contribution is the inclusion of error regressors in the Lasso estimation (which yields a nonlinear regression problem). The proposed algorithm uses cyclical coordinate descent to compute the parameters of the NARMAX models for the entire regularization path. It deals with the error terms by updating the regressor matrix along with the parameter vector. In comparative timings we find that the modification does not reduce the computational efficiency of the original algorithm and can provide the most important regressors in very few inexpensive iterations. The method is illustrated for linear and polynomial models by means of two examples.
SYJun 21, 2017
"Parallel Training Considered Harmful?": Comparing series-parallel and parallel feedforward network trainingAntônio H. Ribeiro, Luis A. Aguirre
Neural network models for dynamic systems can be trained either in parallel or in series-parallel configurations. Influenced by early arguments, several papers justify the choice of series-parallel rather than parallel configuration claiming it has a lower computational cost, better stability properties during training and provides more accurate results. Other published results, on the other hand, defend parallel training as being more robust and capable of yielding more accu- rate long-term predictions. The main contribution of this paper is to present a study comparing both methods under the same unified framework. We focus on three aspects: i) robustness of the estimation in the presence of noise; ii) computational cost; and, iii) convergence. A unifying mathematical framework and simulation studies show situations where each training method provides better validation results, being parallel training better in what is believed to be more realistic scenarios. An example using measured data seems to reinforce such claim. We also show, with a novel complexity analysis and numerical examples, that both methods have similar computational cost, being series series-parallel training, however, more amenable to parallelization. Some informal discussion about stability and convergence properties is presented and explored in the examples.