CVSep 11, 2024

Brain-Inspired Stepwise Patch Merging for Vision Transformers

arXiv:2409.06963v2h-index: 19Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of improving visual understanding in AI systems for researchers and practitioners in computer vision, though it appears incremental as it builds on existing hierarchical ViT designs.

The paper tackled the challenge of enhancing Vision Transformers' hierarchical architecture by proposing Stepwise Patch Merging (SPM), which improved performance on benchmark datasets like ImageNet-1K, COCO, and ADE20K, particularly in dense prediction tasks such as object detection and semantic segmentation.

The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM consists of Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance. The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes