Rethinking Temporal Object Detection from Robotic Perspectives
This work addresses the need for better temporal evaluation metrics in video object detection for robotics applications, offering incremental improvements to existing methods.
The paper tackles the problem of evaluating video object detection from a robotic perspective by proposing non-reference assessments for recall continuity and localization stability to supplement static accuracy metrics like AP, and develops an online tracklet refinement method that improves temporal performance, validated on ImageNet VID and real-world robotic tasks.
Video object detection (VID) has been vigorously studied for years but almost all literature adopts a static accuracy-based evaluation, i.e., average precision (AP). From a robotic perspective, the importance of recall continuity and localization stability is equal to that of accuracy, but the AP is insufficient to reflect detectors' performance across time. In this paper, non-reference assessments are proposed for continuity and stability based on object tracklets. These temporal evaluations can serve as supplements to static AP. Further, we develop an online tracklet refinement for improving detectors' temporal performance through short tracklet suppression, fragment filling, and temporal location fusion. In addition, we propose a small-overlap suppression to extend VID methods to single object tracking (SOT) task so that a flexible SOT-by-detection framework is then formed. Extensive experiments are conducted on ImageNet VID dataset and real-world robotic tasks, where the superiority of our proposed approaches are validated and verified. Codes will be publicly available.