CVFeb 10

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

arXiv:2602.10052v1h-index: 8
Originality Incremental advance
AI Analysis

This work addresses the need for more accurate and stable semantic segmentation in automated driving by enhancing existing transformer architectures with a spatio-temporal attention mechanism, representing an incremental improvement.

The paper tackled the problem of video semantic segmentation by leveraging temporal consistency to improve accuracy and stability in dynamic scenes, achieving improvements of 9.20 percentage points in temporal consistency and up to 1.76 percentage points in mean intersection over union on Cityscapes and BDD100k datasets.

Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes