CVJan 5, 2021

Local Memory Attention for Fast Video Semantic Segmentation

arXiv:2101.01715v240 citations
AI Analysis

This work provides a fast and general module for researchers and practitioners to adapt existing single-frame semantic segmentation models for video tasks, offering strong specific gains in performance.

This paper introduces a novel neural network module that converts single-frame semantic segmentation models into video pipelines. It improves mIoU on Cityscapes by 1.7% for ERFNet and 2.1% for PSPNet, with a minimal 1.5ms increase in inference time for ERFNet.

We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple, fast, and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. In contrast to previous memory-based approaches, we propose a fast local attention layer, providing temporal appearance cues in the local region of prior frames. We further fuse these cues with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes