Low-Resolution Self-Attention for Semantic Segmentation
This work addresses the problem of high computational cost in semantic segmentation for researchers and practitioners, offering an incremental improvement by optimizing existing transformer methods.
The paper tackles the computational bottleneck of high-resolution context modeling in vision transformers for semantic segmentation by introducing Low-Resolution Self-Attention (LRSA), which reduces FLOPs while outperforming state-of-the-art models on datasets like ADE20K, COCO-Stuff, and Cityscapes.
Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high-resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost, i.e., FLOPs. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution, with additional 3x3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models. Code is available at https://github.com/yuhuan-wu/LRFormer.