FluSplat: Sparse-View 3D Editing without Test-Time Optimization
This addresses the problem of slow and inconsistent 3D editing for applications requiring real-time or efficient manipulation, though it is incremental as it builds on existing 3D Gaussian Splatting and diffusion editing methods.
The paper tackles the problem of computationally expensive and inconsistent 3D scene editing from sparse views by proposing a feed-forward framework that eliminates test-time optimization. The result is competitive editing fidelity with substantially improved cross-view consistency and inference time reduced by orders of magnitude.
Recent advances in text-guided image editing and 3D Gaussian Splatting (3DGS) have enabled high-quality 3D scene manipulation. However, existing pipelines rely on iterative edit-and-fit optimization at test time, alternating between 2D diffusion editing and 3D reconstruction. This process is computationally expensive, scene-specific, and prone to cross-view inconsistencies. We propose a feed-forward framework for cross-view consistent 3D scene editing from sparse views. Instead of enforcing consistency through iterative 3D refinement, we introduce a cross-view regularization scheme in the image domain during training. By jointly supervising multi-view edits with geometric alignment constraints, our model produces view-consistent results without per-scene optimization at inference. The edited views are then lifted into 3D via a feedforward 3DGS model, yielding a coherent 3DGS representation in a single forward pass. Experiments demonstrate competitive editing fidelity and substantially improved cross-view consistency compared to optimization-based methods, while reducing inference time by orders of magnitude.