Stride and Translation Invariance in CNNs
This work addresses a fundamental limitation in CNNs for image classification, offering insights into dataset-specific design choices, though it is incremental in nature.
The paper tackles the lack of translation invariance in Convolutional Neural Networks (CNNs) by analyzing stride and local homogeneity, finding that proper combination with pooling kernel size can improve invariance but involves a trade-off with generalization.
Convolutional Neural Networks have become the standard for image classification tasks, however, these architectures are not invariant to translations of the input image. This lack of invariance is attributed to the use of stride which ignores the sampling theorem, and fully connected layers which lack spatial reasoning. We show that stride can greatly benefit translation invariance given that it is combined with sufficient similarity between neighbouring pixels, a characteristic which we refer to as local homogeneity. We also observe that this characteristic is dataset-specific and dictates the relationship between pooling kernel size and stride required for translation invariance. Furthermore we find that a trade-off exists between generalization and translation invariance in the case of pooling kernel size, as larger kernel sizes lead to better invariance but poorer generalization. Finally we explore the efficacy of other solutions proposed, namely global average pooling, anti-aliasing, and data augmentation, both empirically and through the lens of local homogeneity.