IVCVMay 11, 2025

Whitened CLIP as a Likelihood Surrogate of Images and Captions

arXiv:2505.06934v112 citationsh-index: 4Has CodeICML
Originality Synthesis-oriented
AI Analysis

This provides a fast, training-free method for likelihood estimation in vision-language tasks, but it is incremental as it builds on existing CLIP models.

The paper tackled the problem of approximating likelihoods for images and captions by introducing Whitened CLIP, a training-free transformation of CLIP embeddings, which resulted in a simple log-likelihood estimate using Euclidean distance in a whitened space.

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes