Contrastive Learning through Auxiliary Branch for Video Object Detection
This work addresses the problem of detecting objects in degraded video frames for computer vision applications, offering an incremental improvement over prior methods by avoiding extra post-processing.
The paper tackled video object detection by introducing a contrastive learning method to improve robustness to image degradation without increasing computational load during inference, achieving state-of-the-art performance with 84.0% mAP on ResNet-101 and 85.2% mAP on ResNeXt-101 on the ImageNet VID dataset.
Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector's backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.