Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline
It addresses the challenge of reducing annotation costs for object detection and segmentation by enabling learning for unseen categories, though it is incremental as it builds on existing zero-shot frameworks.
The paper tackles the problem of zero-shot object detection and segmentation by analyzing design choices and proposing a simple method that outperforms more complex architectures on the MSCOCO dataset, achieving strong performance gains.
Methods for object detection and segmentation often require abundant instance-level annotations for training, which are time-consuming and expensive to collect. To address this, the task of zero-shot object detection (or segmentation) aims at learning effective methods for identifying and localizing object instances for the categories that have no supervision available. Constructing architectures for these tasks requires choosing from a myriad of design options, ranging from the form of the class encoding used to transfer information from seen to unseen categories, to the nature of the function being optimized for learning. In this work, we extensively study these design choices, and carefully construct a simple yet extremely effective zero-shot recognition method. Through extensive experiments on the MSCOCO dataset on object detection and segmentation, we highlight that our proposed method outperforms existing, considerably more complex, architectures. Our findings and method, which we propose as a competitive future baseline, point towards the need to revisit some of the recent design trends in zero-shot detection / segmentation.