CVMay 29

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han, Hyeonseo Yu, Donghwan Shin, Sunghwan Hong, Takuya Narihira, Kazumi Fukuda, Yuki Mitsufuji

arXiv:2605.3159588.1

Predicted impact top 18% in CV · last 90 daysOriginality Highly original

AI Analysis

This work is significant for researchers in computer vision working on dynamic scene reconstruction and novel-view synthesis, offering a more efficient and robust feed-forward approach.

The paper addresses the challenge of dynamic scene reconstruction from monocular video, which existing feed-forward methods struggle with due to duplicated Gaussians and view-dependent biases. The authors propose C4G, a feed-forward 4D reconstruction framework that uses a compact set of timestamp-conditioned learnable Gaussian query tokens to model globally coherent motion and achieve strong novel-view synthesis performance with significantly fewer Gaussians.

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

View on arXiv PDF

Similar