95.9CVMay 14
VGGT-$Ω$Jianyuan Wang, Minghao Chen, Shangzhan Zhang et al.
Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/
CVSep 16, 2025
MapAnything: Universal Feed-Forward Metric 3D ReconstructionNikhil Keetha, Norman Müller, Johannes Schönberger et al.
We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.
GRJun 23, 2025
BulletGen: Improving 4D Reconstruction with Bullet-Time GenerationDenys Rozumnyi, Jonathon Luiten, Numair Khan et al.
Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen "bullet-time" step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.
CVMar 10, 2025
Fixing the RANSAC Stopping CriterionJohannes Schönberger, Viktor Larsson, Marc Pollefeys
For several decades, RANSAC has been one of the most commonly used robust estimation algorithms for many problems in computer vision and related fields. The main contribution of this paper lies in addressing a long-standing error baked into virtually any system building upon the RANSAC algorithm. Since its inception in 1981 by Fischler and Bolles, many variants of RANSAC have been proposed on top of the same original idea relying on the fact that random sampling has a high likelihood of generating a good hypothesis from minimal subsets of measurements. An approximation to the sampling probability was originally derived by the paper in 1981 in support of adaptively stopping RANSAC and is, as such, used in the vast majority of today's RANSAC variants and implementations. The impact of this approximation has since not been questioned or thoroughly studied by any of the later works. As we theoretically derive and practically demonstrate in this paper, the approximation leads to severe undersampling and thus failure to find good models. The discrepancy is especially pronounced in challenging scenarios with few inliers and high model complexity. An implementation of computing the exact probability is surprisingly simple yet highly effective and has potentially drastic impact across a large range of computer vision systems.
ROSep 21, 2021
Panoptic Multi-TSDFs: a Flexible Representation for Online Multi-resolution Volumetric Mapping and Long-term Dynamic Scene ConsistencyLukas Schmid, Jeffrey Delmerico, Johannes Schönberger et al.
For robotic interaction in environments shared with other agents, access to volumetric and semantic maps of the scene is crucial. However, such environments are inevitably subject to long-term changes, which the map needs to account for. We thus propose panoptic multi-TSDFs as a novel representation for multi-resolution volumetric mapping in changing environments. By leveraging high-level information for 3D reconstruction, our proposed system allocates high resolution only where needed. Through reasoning on the object level, semantic consistency over time is achieved. This enables our method to maintain up-to-date reconstructions with high accuracy while improving coverage by incorporating previous data. We show in thorough experimental evaluation that our map can be efficiently constructed, maintained, and queried during online operation, and that the presented approach can operate robustly on real depth sensors using non-optimized panoptic segmentation as input.