CVMay 8, 2025

DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

arXiv:2505.05473v112 citationsh-index: 45CVPR
AI Analysis

This work addresses Structure-from-Motion for computer vision applications, offering a novel end-to-end approach that improves performance over existing methods.

The paper tackles the problem of 3D scene reconstruction and camera pose estimation from multi-view images by proposing DiffusionSfM, a data-driven method that directly predicts ray origins and endpoints using a transformer-based diffusion model, outperforming classical and learning-based approaches with empirical validation on synthetic and real datasets.

Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes