CVROJun 26, 2025

SAM4D: Segment Anything in Camera and LiDAR Streams

arXiv:2506.21547v19 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient and consistent multi-modal segmentation for autonomous driving systems, though it appears incremental by building on existing foundation models.

The paper tackles the problem of promptable segmentation across camera and LiDAR streams in autonomous driving by introducing SAM4D, which achieves robust segmentation through unified multi-modal alignment and motion-aware temporal consistency, with experiments showing powerful cross-modal segmentation ability.

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes