Joint 2D-3D Segmentation and Association in Street-level Imaging
This work improves object segmentation and identity retention for large-scale urban mapping and Spatial Digital Twin creation.
The paper introduces a unified framework for joint 2D-3D segmentation and association in street-level imagery, using zero-shot detection and structure-from-motion to establish cross-view correspondences. It achieves a 22% performance gain over 2D-only tracking methods in challenging urban scenarios.
Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.