CVJan 5, 2021

Local Memory Attention for Fast Video Semantic Segmentation

Matthieu Paul, Martin Danelljan, Luc Van Gool, Radu Timofte

arXiv:2101.01715v214.440 citationsh-index: 167Has Code

Originality Highly original

AI Analysis

This work provides a fast and general module for researchers and practitioners to adapt existing single-frame semantic segmentation models for video tasks, offering strong specific gains in performance.

This paper introduces a novel neural network module that converts single-frame semantic segmentation models into video pipelines. It improves mIoU on Cityscapes by 1.7% for ERFNet and 2.1% for PSPNet, with a minimal 1.5ms increase in inference time for ERFNet.

We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple, fast, and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. In contrast to previous memory-based approaches, we propose a fast local attention layer, providing temporal appearance cues in the local region of prior frames. We further fuse these cues with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.

View on arXiv PDF Code

Similar