Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking
This work addresses the problem of accurate and efficient object tracking in videos for computer vision applications, representing an incremental improvement over existing deep learning methods.
The paper tackles visual object tracking by introducing a spatially supervised recurrent convolutional neural network that uses regression for direct location prediction, achieving state-of-the-art accuracy and robustness with low computational cost on benchmark datasets.
In this paper, we develop a new approach of spatially supervised recurrent convolutional neural networks for visual object tracking. Our recurrent convolutional network exploits the history of locations as well as the distinctive visual features learned by the deep neural networks. Inspired by recent bounding box regression methods for object detection, we study the regression capability of Long Short-Term Memory (LSTM) in the temporal domain, and propose to concatenate high-level visual features produced by convolutional networks with region information. In contrast to existing deep learning based trackers that use binary classification for region candidates, we use regression for direct prediction of the tracking locations both at the convolutional layer and at the recurrent unit. Our extensive experimental results and performance comparison with state-of-the-art tracking methods on challenging benchmark video tracking datasets shows that our tracker is more accurate and robust while maintaining low computational cost. For most test video sequences, our method achieves the best tracking performance, often outperforms the second best by a large margin.