ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
This work significantly improves the efficiency of 3D reconstruction for applications requiring large image collections, such as real-time scene understanding and robotics, by enabling linear-time processing.
This paper introduces ZipMap, a stateful feed-forward model for 3D reconstruction that processes image collections in linear time, addressing the quadratic scaling issue of prior methods. It reconstructs over 700 frames in under 10 seconds on an H100 GPU, achieving more than 20x speedup compared to state-of-the-art methods like VGGT, while maintaining or exceeding their accuracy.
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.