CVAug 27, 2024

Applying ViT in Generalized Few-shot Semantic Segmentation

arXiv:2408.14957v12.0h-index: 4Has Code

Originality Incremental advance

AI Analysis

It addresses the problem of few-shot segmentation for computer vision researchers, but is incremental as it builds on existing ViT and GFSS frameworks.

This paper tackles generalized few-shot semantic segmentation by applying Vision Transformer (ViT)-based models, achieving a 116% improvement over ResNet structures in one-shot scenarios on the PASCAL-5^i benchmark.

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

View on arXiv PDF Code

Similar