Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives
This work addresses the data and negative sampling bottlenecks in self-supervised learning for computer vision, but it is incremental as it builds on existing methods without introducing a new paradigm.
The paper tackled the problem of reducing reliance on real-world data and curated hard negatives in self-supervised learning for vision transformers by exploring synthetic data and synthetic hard negatives, finding that the Syn2Co framework shows promise but has limitations in enhancing representation robustness and transferability.
This paper does not introduce a new method per se. Instead, we build on existing self-supervised learning approaches for vision, drawing inspiration from the adage "fake it till you make it". While contrastive self-supervised learning has achieved remarkable success, it typically relies on vast amounts of real-world data and carefully curated hard negatives. To explore alternatives to these requirements, we investigate two forms of "faking it" in vision transformers. First, we study the potential of generative models for unsupervised representation learning, leveraging synthetic data to augment sample diversity. Second, we examine the feasibility of generating synthetic hard negatives in the representation space, creating diverse and challenging contrasts. Our framework - dubbed Syn2Co - combines both approaches and evaluates whether synthetically enhanced training can lead to more robust and transferable visual representations on DeiT-S and Swin-T architectures. Our findings highlight the promise and limitations of synthetic data in self-supervised learning, offering insights for future work in this direction.