CVAILGJun 15, 2024

GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR

arXiv:2406.10722v19 citations
Originality Incremental advance
AI Analysis

This addresses the need for realistic synthetic data in domains like autonomous driving and robotics, though it appears incremental as it builds on existing techniques like diffusion models and depth estimation.

The paper tackles the problem of generating multimodal synthetic data by proposing GenMM, a method for inserting 3D objects into RGB videos and LiDAR scans with temporal and geometric consistency, achieving effective results in experiments.

Multimodal synthetic data generation is crucial in domains such as autonomous driving, robotics, augmented/virtual reality, and retail. We propose a novel approach, GenMM, for jointly editing RGB videos and LiDAR scans by inserting temporally and geometrically consistent 3D objects. Our method uses a reference image and 3D bounding boxes to seamlessly insert and blend new objects into target videos. We inpaint the 2D Regions of Interest (consistent with 3D boxes) using a diffusion-based video inpainting model. We then compute semantic boundaries of the object and estimate it's surface depth using state-of-the-art semantic segmentation and monocular depth estimation techniques. Subsequently, we employ a geometry-based optimization algorithm to recover the 3D shape of the object's surface, ensuring it fits precisely within the 3D bounding box. Finally, LiDAR rays intersecting with the new object surface are updated to reflect consistent depths with its geometry. Our experiments demonstrate the effectiveness of GenMM in inserting various 3D objects across video and LiDAR modalities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes