CVJan 4, 2024

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Yiran Song, Qianyu Zhou, Xiangtai Li, Deng-Ping Fan, Xuequan Lu, Lizhuang Ma

arXiv:2401.02317v414.123 citationsh-index: 10Has CodeCVPR

Originality Incremental advance

AI Analysis

This addresses a practical problem for users of SAM in computer vision applications by improving adaptability to varying image resolutions without costly retraining, though it is incremental as it builds on SAM's existing framework.

The paper tackles the performance degradation of the Segment Anything Model (SAM) when handling images of varying resolutions by proposing BA-SAM, a method that reformulates the issue as a length extrapolation problem and introduces a scalable bias-mode attention mask. The result shows significant mitigation of performance degradation in zero-shot settings and achieves state-of-the-art performance with minimal fine-tuning across diverse datasets.

In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM

View on arXiv PDF Code

Similar