CVMay 19, 2025

Pyramid Sparse Transformer: Enhancing Multi-Scale Feature Fusion with Dynamic Token Selection

arXiv:2505.12772v22 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges in resource-constrained environments for vision tasks like detection and classification, offering a plug-and-play enhancement, though it is incremental as it builds on existing models.

The paper tackles the problem of high computational complexity in attention-based feature fusion for vision models by introducing the Pyramid Sparse Transformer (PST), which reduces computation while preserving spatial detail, resulting in mAP improvements of up to 0.9% on MS COCO and top-1 accuracy boosts of up to 6.5% on ImageNet.

Feature fusion is critical for high-performance vision models but often incurs prohibitive complexity. However, prevailing attention-based fusion methods often involve significant computational complexity and implementation challenges, limiting their efficiency in resource-constrained environments. To address these issues, we introduce the Pyramid Sparse Transformer (PST), a lightweight, plug-and-play module that integrates coarse-to-fine token selection and shared attention parameters to reduce computation while preserving spatial detail. PST can be trained using only coarse attention and seamlessly activated at inference for further accuracy gains without retraining. When added to state-of-the-art real-time detection models, such as YOLOv11-N/S/M, PST yields mAP improvements of 0.9%, 0.5%, and 0.4% on MS COCO with minimal latency impact. Likewise, embedding PST into ResNet-18/50/101 as backbones, boosts ImageNet top-1 accuracy by 6.5%, 1.7%, and 1.0%, respectively. These results demonstrate PST's effectiveness as a simple, hardware-friendly enhancement for both detection and classification tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes