Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation
This work improves segmentation models for computer vision applications by enabling better adaptation to novel classes with limited data while maintaining base class performance, representing an incremental advancement in few-shot learning.
The paper tackles the problem of generalized few-shot semantic segmentation by addressing the deterministic limitations of existing prototype-based methods, proposing FewCLIP which introduces probabilistic prototype calibration over multi-modal prototypes from CLIP, achieving significant performance improvements over state-of-the-art approaches on PASCAL-5^i and COCO-20^i datasets.
Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at https://github.com/jliu4ai/FewCLIP.