Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers
This work addresses the computational inefficiency of Vision Transformers for high-resolution images, which is a problem for researchers and practitioners in resource-constrained settings, but it is incremental as it compares existing methods without introducing a new solution.
The paper investigated whether the Reformer architecture, which uses locality-sensitive hashing attention to reduce theoretical complexity, could serve as an efficient alternative to Vision Transformers (ViTs) for computer vision tasks. While the Reformer achieved higher accuracy on CIFAR-10, ViTs outperformed it in practical efficiency and computation time on larger datasets like ImageNet-100 and high-resolution medical images, indicating that theoretical gains require longer token sequences than typical images provide.
Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy--efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.