CVFeb 17, 2024

A Decoding Scheme with Successive Aggregation of Multi-Level Features for Light-Weight Semantic Segmentation

arXiv:2402.11201v2h-index: 2ICIP
AI Analysis

This work addresses the problem of efficient semantic segmentation for applications requiring high-resolution image processing, representing an incremental improvement in method design.

The paper tackles the computational complexity of high-resolution semantic segmentation by proposing a novel decoding scheme that uses successive cross-attention to aggregate multi-level features from a multi-scale encoder, achieving improved segmentation accuracy with significantly lower computational cost compared to state-of-the-art models.

Multi-scale architecture, including hierarchical vision transformer, has been commonly applied to high-resolution semantic segmentation to deal with computational complexity with minimum performance loss. In this paper, we propose a novel decoding scheme for semantic segmentation in this regard, which takes multi-level features from the encoder with multi-scale architecture. The decoding scheme based on a multi-level vision transformer aims to achieve not only reduced computational expense but also higher segmentation accuracy, by introducing successive cross-attention in aggregation of the multi-level features. Furthermore, a way to enhance the multi-level features by the aggregated semantics is proposed. The effort is focused on maintaining the contextual consistency from the perspective of attention allocation and brings improved performance with significantly lower computational cost. Set of experiments on popular datasets demonstrates superiority of the proposed scheme to the state-of-the-art semantic segmentation models in terms of computational cost without loss of accuracy, and extensive ablation studies prove the effectiveness of ideas proposed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes