ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

arXiv:2602.17929v13 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and adaptable models in resource-constrained clinical environments, though it is incremental as it modifies existing Vision Transformer architectures for specific data regimes.

The paper tackled the problem of fixed spatial priors in Vision Transformers hindering generalization in medical imaging by introducing ZACH-ViT, a compact model that removes positional embeddings and the [CLS] token, achieving competitive performance with 0.25M parameters across seven MedMNIST datasets under few-shot conditions.

Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term "Zero-token" specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.

View on arXiv PDF Code

Similar