CVDec 15, 2022

Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation

arXiv:2212.07890v116 citationsh-index: 38
Originality Incremental advance
AI Analysis

This addresses the problem of capturing global interactions in high-resolution feature maps for semantic segmentation, with applications in both natural and medical imaging, though it is incremental as it builds on existing transformer backbones.

The paper tackles the limitation of multi-resolution transformers in semantic segmentation by introducing GLAM, a module that enables full contextual attention across all image regions, leading to substantially better performance on ADE20K, Cityscapes, and achieving state-of-the-art on the BCV dataset.

Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers. GLAM is a generic module that can be integrated into most existing transformer backbones. GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions, and extracts powerful representations during training. Extensive experiments show that GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used to segment large 3D medical images, and GLAM-nnFormer achieves new state-of-the-art performance on the BCV dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes