CVDec 4, 2025

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

arXiv:2512.04939v16 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the problem of time-consuming and memory-intensive processing for large-scale 3D reconstruction applications, though it is incremental as it builds on existing VGGT methods.

The paper tackles the computational inefficiency of 3D vision foundation models like VGGT for large-scale scenes by proposing LiteVGGT, which achieves up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes while retaining core performance.

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes