Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis
This provides theoretical guarantees for efficient RNN training, addressing a bottleneck in sequence modeling for machine learning applications, though it is incremental as it builds on existing analysis frameworks.
The paper tackles the problem of training recurrent neural networks (RNNs) with gradient descent by proving that optimality can be achieved without massive overparameterization, showing that network size scales only logarithmically with sample size, in contrast to prior polynomial dependencies.
We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive overparameterization. Our in-depth nonasymptotic analysis (i) provides improved bounds on the network size $m$ in terms of the sequence length $T$, sample size $n$ and ambient dimension $d$, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with $n$ samples can achieve optimality with a network size $m$ that scales only logarithmically with $n$. This sharply contrasts with the prior works that require high-order polynomial dependency of $m$ on $n$ to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.