CVMar 12, 2025

Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

Wenyi Lian, Patrick Micke, Joakim Lindblad, Nataša Sladoje

arXiv:2503.09826v213.15 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses a problem for medical and remote sensing applications by enabling more effective processing of multimodal data, though it is incremental as it builds on existing ViT frameworks.

The paper tackles the challenge of applying Vision Transformers to multi-channel imaging data by introducing Isolated Channel ViT, which pretrains on single channels and finetunes on multi-channel datasets, achieving 4-14 percentage points improvement over existing methods on tasks like cell microscopy and satellite imaging.

Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.

View on arXiv PDF Code

Similar