CV AINov 12, 2024

Scaling Properties of Diffusion Models for Perceptual Tasks

Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

arXiv:2411.08034v315.818 citationsh-index: 143Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses efficiency and scalability challenges in visual perception for AI and computer vision applications, though it is incremental as it applies existing diffusion model paradigms to new tasks.

The paper tackles visual perception tasks like depth estimation and optical flow by unifying them under an image-to-image translation framework using diffusion models, showing that scaling training and test-time compute leads to competitive performance with state-of-the-art methods while using significantly less data and compute.

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .

View on arXiv PDF

Similar