SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
This addresses the problem of embodied spatial intelligence for robotics or AR/VR applications, representing an incremental/hybrid approach.
The researchers tackled the problem of creating a unified 3D memory system from casual RGB video for indoor environments, resulting in SpatialMem which maintains strong navigation completion and retrieval accuracy across real-life scenes under clutter and occlusion.
We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.