CVApr 8, 2024

HSViT: Horizontally Scalable Vision Transformer

arXiv:2404.05196v28 citationsh-index: 9Has CodeIJCNN
Originality Highly original
AI Analysis

This addresses the problem of making Vision Transformers more efficient and accessible for devices with limited computing resources, representing a novel method rather than an incremental improvement.

The paper tackles the challenges of Vision Transformers requiring large-scale pre-training and high computational resources by introducing HSViT, which eliminates the need for pre-training and achieves up to 10% higher top-1 accuracy on small datasets and up to 3.1% improvement on ImageNet.

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets, while providing existing CNN backbones up to 3.1% improvement in top-1 accuracy on ImageNet. The code is available at https://github.com/xuchenhao001/HSViT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes