CVApr 11, 2019

An Analysis of Pre-Training on Object Detection

Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, Larry S. Davis

arXiv:1904.05871v114.441 citations

Originality Synthesis-oriented

AI Analysis

This work provides empirical insights for researchers and practitioners in computer vision, showing the trade-offs of detection pre-training, but it is incremental as it builds on existing pre-training paradigms without introducing new methods.

The study analyzed convolutional neural networks pre-trained on large object detection datasets, finding that such pre-training significantly improves fine-tuning on small detection datasets, achieving 81.1% mAP on PASCAL-VOC, a 7.6% gain over prior methods, but it benefits localization tasks like semantic segmentation while harming image classification.

We provide a detailed analysis of convolutional neural networks which are pre-trained on the task of object detection. To this end, we train detectors on large datasets like OpenImagesV4, ImageNet Localization and COCO. We analyze how well their features generalize to tasks like image classification, semantic segmentation and object detection on small datasets like PASCAL-VOC, Caltech-256, SUN-397, Flowers-102 etc. Some important conclusions from our analysis are --- 1) Pre-training on large detection datasets is crucial for fine-tuning on small detection datasets, especially when precise localization is needed. For example, we obtain 81.1% mAP on the PASCAL-VOC dataset at 0.7 IoU after pre-training on OpenImagesV4, which is 7.6% better than the recently proposed DeformableConvNetsV2 which uses ImageNet pre-training. 2) Detection pre-training also benefits other localization tasks like semantic segmentation but adversely affects image classification. 3) Features for images (like avg. pooled Conv5) which are similar in the object detection feature space are likely to be similar in the image classification feature space but the converse is not true. 4) Visualization of features reveals that detection neurons have activations over an entire object, while activations for classification networks typically focus on parts. Therefore, detection networks are poor at classification when multiple instances are present in an image or when an instance only covers a small fraction of an image.

View on arXiv PDF

Similar