CVJun 13, 2025

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim

arXiv:2506.11924v211.84 citationsh-index: 29

Originality Incremental advance

AI Analysis

This addresses the challenge of 3D scene completion and novel view synthesis for computer vision applications, representing a novel hybrid approach rather than a foundational breakthrough.

The paper tackles the problem of generating aligned novel view images and geometry from sparse inputs by introducing a diffusion-based framework that uses cross-modal attention distillation between image and geometry branches. The method achieves high-fidelity extrapolative view synthesis across unseen scenes and delivers competitive reconstruction quality in interpolation settings.

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

View on arXiv PDF

Similar