CVJul 22, 2022

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

NVIDIAU of Toronto
arXiv:2207.10866v1191 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in few-shot segmentation and semantic correspondence by improving cost aggregation, though it is incremental as it builds on existing transformer and convolutional methods.

The paper tackles the problem of tokenization-induced discontinuity in correlation maps for few-shot segmentation by proposing a 4D Convolutional Swin Transformer (VAT) that integrates local context and convolutional inductive bias, achieving state-of-the-art performance on standard benchmarks.

This paper presents a novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation. The use of transformers can benefit correlation map aggregation through self-attention over a global receptive field. However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias. To address this problem, we propose a 4D Convolutional Swin Transformer, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias. We additionally boost aggregation performance by applying transformers within a pyramidal structure, where aggregation at a coarser level guides aggregation at a finer level. Noise in the transformer output is then filtered in the subsequent decoder with the help of the query's appearance embedding. With this model, a new state-of-the-art is set for all the standard benchmarks in few-shot segmentation. It is shown that VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes