CVMay 3, 2025

Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

arXiv:2505.01737v31 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the problem of noisy and partial geometry predictions in dynamic scenes for computer vision applications, representing an incremental advance over existing methods.

The paper tackles the challenge of estimating 3D geometry in dynamic scenes from monocular videos by introducing a new model called MMP, which produces dynamic pointmaps in a feed-forward manner and achieves a 15.1% improvement in regression error.

In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes