CVAIApr 7, 2024

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

arXiv:2404.04924v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the problem of limited data availability for training vision models, offering a solution for researchers and practitioners working with small datasets, though it is incremental as it builds on existing transformer and graph-based methods.

The paper tackles the performance gap between Vision Transformers and CNvolutional Neural Networks when trained from scratch on small datasets by proposing a Graph-based Vision Transformer (GvT) that uses graph convolutional projection, graph-pooling, and talking-heads technology, achieving comparable or superior results to deep CNNs and outperforming vision transformers without pre-training.

Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. The code for our proposed model is publicly available on the website.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes