CVAILGMay 29, 2025

Implicit Inversion turns CLIP into a Decoder

arXiv:2505.23161v21 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work reveals untapped generative potential in discriminative models like CLIP, enabling new applications without model modifications, though it is incremental in leveraging existing CLIP capabilities.

The authors tackled the problem of generating images using CLIP without any additional decoder or training, achieving text-to-image generation, style transfer, and image reconstruction solely through optimization of an implicit neural representation.

CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes