CVAIMay 12

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

arXiv:2605.1639347.6
Predicted impact top 72% in CV · last 90 daysOriginality Incremental advance
AI Analysis

Addresses the performance gap of Vision Transformers in biomedical segmentation by combining global priors with local inductive bias, enabling efficient cross-domain adaptation.

ViTC-UNet conditions a UNet on frozen pre-trained ViT representations via learnable tokens and two-way attention, outperforming baselines in semantic segmentation across MRI and CT modalities.

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes