Residual Bi-Fusion Feature Pyramid Network for Accurate Single-shot Object Detection
This work addresses a key bottleneck in object detection for computer vision applications, offering an incremental improvement over existing feature pyramid methods.
The paper tackled the problem of object detection accuracy degradation with deeper feature pyramids due to positional shifts from pooling, proposing a residual bi-directional feature pyramid that fuses deep and shallow features to improve detection for both small and large objects, achieving state-of-the-art results on VOC and COCO datasets.
State-of-the-art (SoTA) models have improved the accuracy of object detection with a large margin via a FP (feature pyramid). FP is a top-down aggregation to collect semantically strong features to improve scale invariance in both two-stage and one-stage detectors. However, this top-down pathway cannot preserve accurate object positions due to the shift-effect of pooling. Thus, the advantage of FP to improve detection accuracy will disappear when more layers are used. The original FP lacks a bottom-up pathway to offset the lost information from lower-layer feature maps. It performs well in large-sized object detection but poor in small-sized object detection. A new structure "residual feature pyramid" is proposed in this paper. It is bidirectional to fuse both deep and shallow features towards more effective and robust detection for both small-sized and large-sized objects. Due to the "residual" nature, it can be easily trained and integrated to different backbones (even deeper or lighter) than other bi-directional methods. One important property of this residual FP is: accuracy improvement is still found even if more layers are adopted. Extensive experiments on VOC and MS COCO datasets showed the proposed method achieved the SoTA results for highly-accurate and efficient object detection..