CVLGMar 7

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

arXiv:2603.07307v11 citations
Predicted impact top 7% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work provides a method to accelerate SAM models for practitioners and researchers by reducing computational cost while maintaining performance, which is an incremental improvement for the computer vision community.

The paper addresses the challenge of applying token merging techniques to Segment Anything Models (SAM) without retraining, which is difficult due to SAM's mixed attention mechanisms and reliance on dense features. They propose StructSAM, a resolution-preserving merge-unmerge framework that reduces encoder FLOPs by 25-30% (up to 40%+ with prompt-aware merging) with minor drops in mIoU/Dice across eight benchmarks.

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes