Unsupervised Training of Vision Transformers with Synthetic Negatives
This addresses a neglected aspect in self-supervised learning for vision transformers, but it is incremental as it builds on existing synthetic negative techniques.
The paper tackles the problem of improving vision transformer representation learning by integrating synthetic hard negatives, resulting in performance improvements for DeiT-S and Swin-T architectures.
This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.