Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
This addresses memory scalability issues for streaming 3D perception tasks, making long-horizon inference more practical, though it is incremental as it builds on existing StreamVGGT methods.
The paper tackles the problem of unbounded memory growth in streaming visual transformers by proposing a training-free token eviction policy that bounds memory usage while maintaining accuracy. It reduces peak memory from 18.63 GB to 9.39 GB on 7-Scenes with only a 0.003 drop in accuracy and completeness, enabling denser frame sampling under strict budgets.
Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.