CVAILGMay 23, 2024

Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

arXiv:2405.14857v3h-index: 52
Originality Incremental advance
AI Analysis

This addresses the need for more effective image variation techniques in computer vision, though it appears incremental as it builds on existing diffusion model approaches.

The paper tackles the problem of generating diverse image variations while preserving semantic context by introducing a diffusion model, Semantica, trained on web-scale image pairs, which adaptively generates new images from a dataset using images as input.

Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in extracting relevant context from the input image. Once trained, \textit{Semantica} can adaptively generate new images from a dataset by simply using images from that dataset as input. Finally, we identify limitations in standard image consistency metrics for evaluating image variations and propose alternative metrics based on few-shot generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes