CVAug 27, 2024

Applying ViT in Generalized Few-shot Semantic Segmentation

arXiv:2408.14957v1h-index: 4
Originality Incremental advance
AI Analysis

It addresses the problem of few-shot segmentation for computer vision researchers, but is incremental as it builds on existing ViT and GFSS frameworks.

This paper tackles generalized few-shot semantic segmentation by applying Vision Transformer (ViT)-based models, achieving a 116% improvement over ResNet structures in one-shot scenarios on the PASCAL-5^i benchmark.

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes