Clockwork Convnets for Video Semantic Segmentation
This addresses the need for efficient real-time video segmentation for applications like autonomous driving or surveillance, though it is incremental as it builds on existing convnet architectures.
The paper tackled the problem of high computational cost in video semantic segmentation by proposing clockwork convnets that schedule layer updates based on semantic stability, achieving reduced computation and latency while maintaining accuracy on datasets like Youtube-Objects, NYUD, and Cityscapes.
Recent years have seen tremendous progress in still-image segmentation; however the naïve application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We propose a video recognition framework that relies on two key observations: 1) while pixels may change rapidly from frame to frame, the semantic content of a scene evolves more slowly, and 2) execution can be viewed as an aspect of architecture, yielding purpose-fit computation schedules for networks. We define a novel family of "clockwork" convnets driven by fixed or adaptive clock signals that schedule the processing of different layers at different update rates according to their semantic stability. We design a pipeline schedule to reduce latency for real-time recognition and a fixed-rate schedule to reduce overall computation. Finally, we extend clockwork scheduling to adaptive video processing by incorporating data-driven clocks that can be tuned on unlabeled video. The accuracy and efficiency of clockwork convnets are evaluated on the Youtube-Objects, NYUD, and Cityscapes video datasets.