LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

arXiv:2606.0567772.7

AI Analysis

For researchers working on multimodal LLMs for long-video understanding, this work addresses the underexplored problem of spatial memory in long-horizon tasks like autonomous driving and navigation.

The paper introduces LongSpace-Bench, a benchmark for long-horizon spatial memory in videos, and proposes LongSpace, a memory framework that improves long-video spatial understanding by incorporating 3D structural cues and layer-aware memory. Experiments show performance gains on spatial reasoning benchmarks.

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

View on arXiv PDF

Similar