CVJun 2, 2025

Unconditional CNN denoisers contain sparse semantic representation of images

arXiv:2506.01912v21 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work provides insights into unsupervised semantic learning in generative models, which could benefit researchers in computer vision and AI, though it is incremental in exploring model interpretability.

The study investigated the internal mechanisms of unconditional CNN denoisers in diffusion models, revealing that the middle block of a UNet decomposes images into sparse channel representations that capture semantic similarity without supervision, and developed a self-guided algorithm for stochastic image reconstruction based on this representation.

Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine the image representation that arises from score estimation in a {fully-convolutional unconditional UNet}. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. Euclidean distances in this representation space are semantically meaningful, even though no conditioning information is provided during training. We develop a novel algorithm for stochastic reconstruction of images conditioned on this representation: The synthesis using the unconditional model is "self-guided" by the representation extracted from that very same model. For a given representation, the common patterns in the set of reconstructed samples reveal the features captured in the middle block of the UNet. Together, these results show, for the first time, that a measure of semantic similarity emerges, unsupervised, solely from the denoising objective.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes