BSUV-Net: A Fully-Convolutional Neural Network for Background Subtraction of Unseen Videos
This addresses a gap in computer vision for applications like object tracking and people recognition by enabling effective background subtraction on completely unseen videos, though it is an incremental improvement over existing supervised methods.
The paper tackles the problem of background subtraction for unseen videos, where existing supervised methods rely on annotated frames from the test video during training, and proposes BSUV-Net, a fully-convolutional neural network that outperforms state-of-the-art algorithms on the CDNet-2014 dataset in metrics like F-measure, recall, and precision.
Background subtraction is a basic task in computer vision and video processing often applied as a pre-processing step for object tracking, people recognition, etc. Recently, a number of successful background-subtraction algorithms have been proposed, however nearly all of the top-performing ones are supervised. Crucially, their success relies upon the availability of some annotated frames of the test video during training. Consequently, their performance on completely "unseen" videos is undocumented in the literature. In this work, we propose a new, supervised, background-subtraction algorithm for unseen videos (BSUV-Net) based on a fully-convolutional neural network. The input to our network consists of the current frame and two background frames captured at different time scales along with their semantic segmentation maps. In order to reduce the chance of overfitting, we also introduce a new data-augmentation technique which mitigates the impact of illumination difference between the background frames and the current frame. On the CDNet-2014 dataset, BSUV-Net outperforms state-of-the-art algorithms evaluated on unseen videos in terms of several metrics including F-measure, recall and precision.