Learning a Fast 3D Spectral Approach to Object Segmentation and Tracking over Space and Time
This work addresses the problem of consistent object segmentation and tracking in videos for computer vision applications, representing a novel method rather than an incremental improvement.
The authors tackled video object segmentation and tracking by posing it as spectral graph clustering in space and time, achieving state-of-the-art results on multiple benchmarks with a fast GPU implementation that is orders of magnitude faster than classical approaches.
We pose video object segmentation as spectral graph clustering in space and time, with one graph node for each pixel and edges forming local space-time neighborhoods. We claim that the strongest cluster in this video graph represents the salient object. We start by introducing a novel and efficient method based on 3D filtering for approximating the spectral solution, as the principal eigenvector of the graph's adjacency matrix, without explicitly building the matrix. This key property allows us to have a fast parallel implementation on GPU, orders of magnitude faster than classical approaches for computing the eigenvector. Our motivation for a spectral space-time clustering approach, unique in video semantic segmentation literature, is that such clustering is dedicated to preserving object consistency over time, which we evaluate using our novel segmentation consistency measure. Further on, we show how to efficiently learn the solution over multiple input feature channels. Finally, we extend the formulation of our approach beyond the segmentation task, into the realm of object tracking. In extensive experiments we show significant improvements over top methods, as well as over powerful ensembles that combine them, achieving state-of-the-art on multiple benchmarks, both for tracking and segmentation.