Can Local Representation Alignment RNNs Solve Temporal Tasks?
This work addresses training instabilities in RNNs for fields like natural language processing and robotics, but it is incremental as it modifies an existing LRA method to improve gradient flow.
The paper tackled the problem of training recurrent neural networks (RNNs) with local representation alignment (LRA) to address instabilities like vanishing gradients, but found that LRA RNNs still suffer from vanishing gradients and are difficult to train. The result was that introducing gradient regularization improved convergence, with the regularized LRA RNN considerably outperforming the unregularized version on temporal tasks such as temporal order, 3-bit temporal order, and random permutation.
Recurrent Neural Networks (RNNs) are commonly used for real-time processing, streaming data, and cases where the amount of training samples is limited. Backpropagation Through Time (BPTT) is the predominant algorithm for training RNNs; however, it is frequently criticized for being prone to exploding and vanishing gradients and being biologically implausible. In this paper, we present and evaluate a target propagation-based method for RNNs, which uses local updates and seeks to reduce the said instabilities. Having stable RNN models increases their practical use in a wide range of fields such as natural language processing, time-series forecasting, anomaly detection, control systems, and robotics. The proposed solution uses local representation alignment (LRA). We thoroughly analyze the performance of this method, experiment with normalization and different local error functions, and invalidate certain assumptions about the behavior of this type of learning. Namely, we demonstrate that despite the decomposition of the network into sub-graphs, the model still suffers from vanishing gradients. We also show that gradient clipping as proposed in LRA has little to no effect on network performance. This results in an LRA RNN model that is very difficult to train due to vanishing gradients. We address this by introducing gradient regularization in the direction of the update and demonstrate that this modification promotes gradient flow and meaningfully impacts convergence. We compare and discuss the performance of the algorithm, and we show that the regularized LRA RNN considerably outperforms the unregularized version on three landmark tasks: temporal order, 3-bit temporal order, and random permutation.