MLNov 25, 2022
Minimal Width for Universal Property of Deep RNNChang hoon Song, Geonho Hwang, Jun ho Lee et al.
A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep networks with bounded widths are more effective than wide networks in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show that a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+2$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is $\tanh$ or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and proof technique can be an initial step toward further research on deep RNNs.
CVJul 21, 2022
Learning from Data with Noisy Labels Using Temporal Self-EnsembleJun Ho Lee, Jae Soon Baik, Tae Hwan Hwang et al.
There are inevitably many mislabeled data in real-world datasets. Because deep neural networks (DNNs) have an enormous capacity to memorize noisy labels, a robust training scheme is required to prevent labeling errors from degrading the generalization performance of DNNs. Current state-of-the-art methods present a co-training scheme that trains dual networks using samples associated with small losses. In practice, however, training two networks simultaneously can burden computing resources. In this study, we propose a simple yet effective robust training scheme that operates by training only a single network. During training, the proposed method generates temporal self-ensemble by sampling intermediate network parameters from the weight trajectory formed by stochastic gradient descent optimization. The loss sum evaluated with these self-ensembles is used to identify incorrectly labeled samples. In parallel, our method generates multi-view predictions by transforming an input data into various forms and considers their agreement to identify incorrectly labeled samples. By combining the aforementioned metrics, we present the proposed {\it self-ensemble-based robust training} (SRT) method, which can filter the samples with noisy labels to reduce their influence on training. Experiments on widely-used public datasets demonstrate that the proposed method achieves a state-of-the-art performance in some categories without training the dual networks.