Morphing Tokens Draw Strong Masked Image Models
This addresses a key bottleneck in self-supervised learning for computer vision, offering a method to enhance representation learning with minimal extra cost, though it appears incremental as it builds on existing MIM frameworks.
The paper tackled the problem of spatially inconsistent supervision in masked image modeling (MIM) for Vision Transformers, introducing Dynamic Token Morphing (DTM) to generate contextualized targets, which resulted in improved training trends and surpassed state-of-the-art MIM methods on benchmarks like ImageNet-1K and ADE20K.
Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models like vision-language models. While using tokenizers or pre-trained models is viable, they often offer spatially inconsistent supervision even for neighboring tokens, hindering models from learning discriminative representations. Our pilot study identifies spatial inconsistency in supervisory signals and suggests that addressing it can improve representation learning. Building upon this insight, we introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets, thereby likely reducing spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs. Our method facilitates MIM training by using more spatially consistent targets, resulting in improved training trends as evidenced by lower losses. Experiments on ImageNet-1K and ADE20K demonstrate DTM's superiority, which surpasses complex state-of-the-art MIM methods. Furthermore, the evaluation of transfer learning on downstream tasks like iNaturalist, along with extensive empirical studies, supports DTM's effectiveness.