Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
This work provides a more interpretable model for image classification, potentially benefiting researchers and practitioners in explainable AI, though it is incremental as it builds on existing bag-of-feature methods.
The authors tackled the problem of understanding deep neural network decisions by introducing BagNet, a variant of ResNet-50 that classifies images based on local features without spatial ordering, achieving 87.6% top-5 accuracy on ImageNet with 33x33 pixel features and AlexNet-level performance with 17x17 pixel features.
Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 33 x 33 px features and Alexnet performance for 17 x 17 px features). The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies.