CVDec 10, 2023

RepViT-SAM: Towards Real-Time Segmenting Anything

arXiv:2312.05760v237 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of real-time segmentation on resource-constrained mobile devices, offering an incremental improvement over prior methods like MobileSAM.

The paper tackles the high computational cost of the Segment Anything Model (SAM) for mobile deployment by replacing its image encoder with RepViT, resulting in RepViT-SAM that achieves significantly better zero-shot transfer capability than MobileSAM and nearly 10x faster inference speed.

Segment Anything Model (SAM) has shown impressive zero-shot transfer performance for various computer vision tasks recently. However, its heavy computation costs remain daunting for practical applications. MobileSAM proposes to replace the heavyweight image encoder in SAM with TinyViT by employing distillation, which results in a significant reduction in computational requirements. However, its deployment on resource-constrained mobile devices still encounters challenges due to the substantial memory and computational overhead caused by self-attention mechanisms. Recently, RepViT achieves the state-of-the-art performance and latency trade-off on mobile devices by incorporating efficient architectural designs of ViTs into CNNs. Here, to achieve real-time segmenting anything on mobile devices, following MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model, ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly $10\times$ faster inference speed. The code and models are available at \url{https://github.com/THU-MIG/RepViT}.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes