WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP
This addresses the need for efficient, adaptive vision-language models, though it is incremental as it builds on CLIP with a novel tokenization method.
The authors tackled the problem of adaptive resolution inference in CLIP by introducing WAVECLIP, which uses wavelet tokenization to process images from coarse to fine, enabling dynamic compute-accuracy trade-offs and achieving competitive accuracy with significant computational savings.
We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.