CV LGOct 13, 2022

Vision Transformers provably learn spatial structure

Samy Jelassi, Michael E. Sander, Yuanzhi Li

arXiv:2210.09221v129.2111 citationsh-index: 28

Originality Incremental advance

AI Analysis

This provides theoretical justification for ViTs' empirical success, addressing a foundational gap in understanding their learning mechanisms, though it is incremental as it builds on existing models and datasets.

The paper tackles the question of how Vision Transformers (ViTs) learn spatially localized patterns without built-in visual inductive biases, by proposing a simplified ViT model with positional attention and proving that it implicitly learns spatial structure, enabling efficient transfer to downstream datasets with shared structure.

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any visual inductive bias of spatial locality. Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns. This raises a central question: how do ViTs learn these patterns by solely minimizing their training loss using gradient-based methods from random initialization? In this paper, we provide some theoretical justification of this phenomenon. We propose a spatially structured dataset and a simplified ViT model. In this model, the attention matrix solely depends on the positional encodings. We call this mechanism the positional attention mechanism. On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association. We prove that patch association helps to sample-efficiently transfer to downstream datasets that share the same structure as the pre-training one but differ in the features. Lastly, we empirically verify that a ViT with positional attention performs similarly to the original one on CIFAR-10/100, SVHN and ImageNet.

View on arXiv PDF

Similar