What is Holding Back Convnets for Detection?
This work identifies critical bottlenecks in convnet-based object detection, offering insights for researchers and practitioners to improve model robustness, though it is incremental in analyzing existing methods.
The paper investigates the limitations of convolutional neural networks in object detection, finding that state-of-the-art architectures lack invariance to appearance factors and that architectural changes, not just more data, are needed to address these weaknesses. It reports improved performance on Pascal3D+ detection and view-point estimation tasks through data augmentation with image renderings.
Convolutional neural networks have recently shown excellent results in general object detection and many other tasks. Albeit very effective, they involve many user-defined design choices. In this paper we want to better understand these choices by inspecting two key aspects "what did the network learn?", and "what can the network learn?". We exploit new annotations (Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite common belief, our results indicate that existing state-of-the-art convnet architectures are not invariant to various appearance factors. In fact, all considered networks have similar weak points which cannot be mitigated by simply increasing the training data (architectural changes are needed). We show that overall performance can improve when using image renderings for data augmentation. We report the best known results on the Pascal3D+ detection and view-point estimation tasks.