CVOct 12, 2022

BoxMask: Revisiting Bounding Box Supervision for Video Object Detection

Khurram Azeem Hashmi, Alain Pagani, Didier Stricker, Muhammamd Zeshan Afzal

arXiv:2210.06008v17.314 citationsh-index: 50

Originality Incremental advance

AI Analysis

This work addresses a specific limitation in video object detection for applications requiring fine-grained object discrimination, representing an incremental improvement over existing methods.

The paper tackles the problem of confusion among objects with similar appearance or motion in video object detection by proposing BoxMask, a method that uses bounding box annotations as coarse masks to incorporate pixel-level information, resulting in consistent and significant improvements on ImageNet VID and EPIC KITCHENS datasets.

We present a new, simple yet effective approach to uplift video object detection. We observe that prior works operate on instance-level feature aggregation that imminently neglects the refined pixel-level representation, resulting in confusion among objects sharing similar appearance or motion characteristics. To address this limitation, we propose BoxMask, which effectively learns discriminative representations by incorporating class-aware pixel-level information. We simply consider bounding box-level annotations as a coarse mask for each object to supervise our method. The proposed module can be effortlessly integrated into any region-based detector to boost detection. Extensive experiments on ImageNet VID and EPIC KITCHENS datasets demonstrate consistent and significant improvement when we plug our BoxMask module into numerous recent state-of-the-art methods.

View on arXiv PDF

Similar