CVMar 10, 2024

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

CMU
arXiv:2403.06090v476 citationsh-index: 16ICLR
Originality Incremental advance
AI Analysis

This work addresses the problem of efficiently adapting pre-trained diffusion models for general dense perception tasks, offering a faster and more effective fine-tuning method for researchers and practitioners in computer vision.

The paper investigates key factors for repurposing text-to-image diffusion models for dense perception tasks like depth estimation and segmentation, finding that high-quality fine-tuning data and image-level supervision are crucial, and proposes GenPercept, a deterministic one-step paradigm that achieves faster inference and improved fine-grained details across multiple tasks.

Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on deterministic visual perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes