Beyond Semantic Image Segmentation : Exploring Efficient Inference in Video
This work addresses video semantic segmentation, an incremental improvement over existing image-based methods.
The paper tackles the problem of extending semantic segmentation from images to video by adapting CRF inference methods to handle video data efficiently, achieving inference over ten thousand images within seconds.
We explore the efficiency of the CRF inference module beyond image level semantic segmentation. The key idea is to combine the best of two worlds of semantic co-labeling and exploiting more expressive models. Similar to [Alvarez14] our formulation enables us perform inference over ten thousand images within seconds. On the other hand, it can handle higher-order clique potentials similar to [vineet2014] in terms of region-level label consistency and context in terms of co-occurrences. We follow the mean-field updates for higher order potentials similar to [vineet2014] and extend the spatial smoothness and appearance kernels [DenseCRF13] to address video data inspired by [Alvarez14]; thus making the system amenable to perform video semantic segmentation most effectively.