CVFeb 26

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

arXiv:2602.23361v17 citationsh-index: 24
Originality Highly original
AI Analysis

This work provides a significant speed-up for researchers and practitioners working on large-scale offline 3D reconstruction, making it more feasible to process extensive image collections.

The paper introduces VGG-T^3, a scalable 3D reconstruction model that overcomes the quadratic scaling of offline feed-forward methods by distilling the varying-length Key-Value space into a fixed-size MLP. This approach enables linear scaling with the number of input views, reconstructing a 1k image collection in 54 seconds, which is an 11.6x speed-up over softmax attention baselines.

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes