Recurrent Spatial Transformer Networks
This work addresses the problem of improving accuracy in sequential image classification for computer vision applications, but it is incremental as it builds on existing spatial transformer and recurrent network components.
The paper tackled digit classification in cluttered MNIST sequences by integrating spatial transformer networks into recurrent neural networks, achieving a single digit error of 1.5%, which is lower than convolutional networks (2.9%) and convolutional networks with SPN layers (2.0%).
We integrate the recently proposed spatial transformer network (SPN) [Jaderberg et. al 2015] into a recurrent neural network (RNN) to form an RNN-SPN model. We use the RNN-SPN to classify digits in cluttered MNIST sequences. The proposed model achieves a single digit error of 1.5% compared to 2.9% for a convolutional networks and 2.0% for convolutional networks with SPN layers. The SPN outputs a zoomed, rotated and skewed version of the input image. We investigate different down-sampling factors (ratio of pixel in input and output) for the SPN and show that the RNN-SPN model is able to down-sample the input images without deteriorating performance. The down-sampling in RNN-SPN can be thought of as adaptive down-sampling that minimizes the information loss in the regions of interest. We attribute the superior performance of the RNN-SPN to the fact that it can attend to a sequence of regions of interest.