CVAINov 12, 2024

Scaling Properties of Diffusion Models for Perceptual Tasks

arXiv:2411.08034v318 citationsh-index: 17Has CodeCVPR
AI Analysis

This work addresses efficiency and scalability challenges in visual perception for AI and computer vision applications, though it is incremental as it applies existing diffusion model paradigms to new tasks.

The paper tackles visual perception tasks like depth estimation and optical flow by unifying them under an image-to-image translation framework using diffusion models, showing that scaling training and test-time compute leads to competitive performance with state-of-the-art methods while using significantly less data and compute.

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes