Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection
This work addresses the computational cost and overfitting issues of fine-tuning large vision models for SOD, offering an efficient and effective solution for researchers and practitioners in object detection.
GLASSNet introduces a global-local feature decoding framework using frozen SAMv2 with a lightweight adapter, reducing learnable parameters by over 97%, and achieves state-of-the-art performance on salient object detection and camouflaged object detection benchmarks.
Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.