CVOct 11, 2017

Detect to Track and Track to Detect

arXiv:1710.03958v2594 citations
Originality Incremental advance
AI Analysis

This work addresses the need for simpler and more effective solutions for video object detection and tracking, which is important for applications like autonomous driving and surveillance, though it is incremental in improving existing methods.

The paper tackles the problem of high-accuracy object detection and tracking in video by proposing a ConvNet architecture that jointly performs these tasks, achieving state-of-the-art results on the ImageNet VID dataset with better single-model performance than the previous winning method.

Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes