InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
This addresses the challenge of coherent 3D scene editing for applications like virtual reality or content creation, though it is incremental as it builds on existing diffusion models.
The paper tackles the problem of multi-view image editing from sparse input views, aiming to modify scenes according to textual instructions while preserving consistency across views, and demonstrates that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.
We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.