CV AI LGMay 23, 2024

Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

Manoj Kumar, Neil Houlsby, Emiel Hoogeboom

arXiv:2405.14857v32.0h-index: 52

Originality Incremental advance

AI Analysis

This addresses the need for more effective image variation techniques in computer vision, though it appears incremental as it builds on existing diffusion model approaches.

The paper tackles the problem of generating diverse image variations while preserving semantic context by introducing a diffusion model, Semantica, trained on web-scale image pairs, which adaptively generates new images from a dataset using images as input.

Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in extracting relevant context from the input image. Once trained, \textit{Semantica} can adaptively generate new images from a dataset by simply using images from that dataset as input. Finally, we identify limitations in standard image consistency metrics for evaluating image variations and propose alternative metrics based on few-shot generation.

View on arXiv PDF

Similar