BoxMask: Revisiting Bounding Box Supervision for Video Object Detection
This work addresses a specific limitation in video object detection for applications requiring fine-grained object discrimination, representing an incremental improvement over existing methods.
The paper tackles the problem of confusion among objects with similar appearance or motion in video object detection by proposing BoxMask, a method that uses bounding box annotations as coarse masks to incorporate pixel-level information, resulting in consistent and significant improvements on ImageNet VID and EPIC KITCHENS datasets.
We present a new, simple yet effective approach to uplift video object detection. We observe that prior works operate on instance-level feature aggregation that imminently neglects the refined pixel-level representation, resulting in confusion among objects sharing similar appearance or motion characteristics. To address this limitation, we propose BoxMask, which effectively learns discriminative representations by incorporating class-aware pixel-level information. We simply consider bounding box-level annotations as a coarse mask for each object to supervise our method. The proposed module can be effortlessly integrated into any region-based detector to boost detection. Extensive experiments on ImageNet VID and EPIC KITCHENS datasets demonstrate consistent and significant improvement when we plug our BoxMask module into numerous recent state-of-the-art methods.