JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space
This work addresses the high test-time cost and structural inconsistencies in 3D scene editing for computer vision researchers, offering a feed-forward alternative to per-scene optimization.
JointEdit3D introduces a feed-forward 3D scene editing method using a unified latent space for RGB and geometry, achieving improved edited-region quality and 3D structural completeness over prior baselines while maintaining background preservation. The method is evaluated on a new dataset of 15K paired editing samples.
Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.