Triply Supervised Decoder Networks for Joint Detection and Segmentation
This work addresses a domain-specific problem for applications like self-driving cars, offering an incremental improvement over existing methods by better leveraging task interactions.
The paper tackles the problem of joint object detection and semantic segmentation by proposing TripleNet, a framework that uses triple supervisions and light-weight modules to enhance mutual benefits between tasks, resulting in improved accuracies on VOC2007 and VOC2012 datasets without extra test-time computational costs.
Joint object detection and semantic segmentation can be applied to many fields, such as self-driving cars and unmanned surface vessels. An initial and important progress towards this goal has been achieved by simply sharing the deep convolutional features for the two tasks. However, this simple scheme is unable to make full use of the fact that detection and segmentation are mutually beneficial. To overcome this drawback, we propose a framework called TripleNet where triple supervisions including detection-oriented supervision, class-aware segmentation supervision, and class-agnostic segmentation supervision are imposed on each layer of the decoder network. Class-agnostic segmentation supervision provides an objectness prior knowledge for both semantic segmentation and object detection. Besides the three types of supervisions, two light-weight modules (i.e., inner-connected module and attention skip-layer fusion) are also incorporated into each layer of the decoder. In the proposed framework, detection and segmentation can sufficiently boost each other. Moreover, class-agnostic and class-aware segmentation on each decoder layer are not performed at the test stage. Therefore, no extra computational costs are introduced at the test stage. Experimental results on the VOC2007 and VOC2012 datasets demonstrate that the proposed TripleNet is able to improve both the detection and segmentation accuracies without adding extra computational costs.