Improving Object Detection with Selective Self-supervised Self-training
This work addresses the challenge of improving object detection for everyday scenes by leveraging diverse Web data, representing an incremental advance in self-supervised self-training methods.
The paper tackles the problem of augmenting human-curated object detection datasets with Web images by using image-to-image search to reduce domain shift and a selective net to rectify supervision signals, achieving state-of-the-art results on detecting backpacks, chairs, and other challenging classes.
We study how to leverage Web images to augment human-curated object detection datasets. Our approach is two-pronged. On the one hand, we retrieve Web images by image-to-image search, which incurs less domain shift from the curated data than other search methods. The Web images are diverse, supplying a wide variety of object poses, appearances, their interactions with the context, etc. On the other hand, we propose a novel learning method motivated by two parallel lines of work that explore unlabeled data for image classification: self-training and self-supervised learning. They fail to improve object detectors in their vanilla forms due to the domain gap between the Web images and curated datasets. To tackle this challenge, we propose a selective net to rectify the supervision signals in Web images. It not only identifies positive bounding boxes but also creates a safe zone for mining hard negative boxes. We report state-of-the-art results on detecting backpacks and chairs from everyday scenes, along with other challenging object classes.