RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free
This work improves single-shot detectors for embedded vision applications, making them competitive with two-stage methods, though it is incremental as it builds on existing techniques.
The paper tackles the accuracy gap between single-shot and two-stage detectors by enhancing RetinaNet with instance mask prediction, adaptive loss, and hard example training, achieving up to 41.7 mAP on COCO test-dev compared to 39.1 mAP for RetinaNet-101 with no runtime increase.
Recently two-stage detectors have surged ahead of single-shot detectors in the accuracy-vs-speed trade-off. Nevertheless single-shot detectors are immensely popular in embedded vision applications. This paper brings single-shot detectors up to the same level as current two-stage techniques. We do this by improving training for the state-of-the-art single-shot detector, RetinaNet, in three ways: integrating instance mask prediction for the first time, making the loss function adaptive and more stable, and including additional hard examples in training. We call the resulting augmented network RetinaMask. The detection component of RetinaMask has the same computational cost as the original RetinaNet, but is more accurate. COCO test-dev results are up to 41.4 mAP for RetinaMask-101 vs 39.1mAP for RetinaNet-101, while the runtime is the same during evaluation. Adding Group Normalization increases the performance of RetinaMask-101 to 41.7 mAP. Code is at:https://github.com/chengyangfu/retinamask