Quantifying Translation-Invariance in Convolutional Neural Networks
This addresses a fundamental issue in object recognition for researchers and practitioners, providing insights into model design, but it is incremental as it builds on existing hypotheses.
The paper tackled the problem of understanding translation invariance in CNNs by developing a tool to quantify it, finding that architectural choices have minor effects and data augmentation is the key factor.
A fundamental problem in object recognition is the development of image representations that are invariant to common transformations such as translation, rotation, and small deformations. There are multiple hypotheses regarding the source of translation invariance in CNNs. One idea is that translation invariance is due to the increasing receptive field size of neurons in successive convolution layers. Another possibility is that invariance is due to the pooling operation. We develop a simple a tool, the translation-sensitivity map, which we use to visualize and quantify the translation-invariance of various architectures. We obtain the surprising result that architectural choices such as the number of pooling layers and the convolution filter size have only a secondary effect on the translation-invariance of a network. Our analysis identifies training data augmentation as the most important factor in obtaining translation-invariant representations of images using convolutional neural networks.