VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis
This work addresses a fundamental challenge in image-to-image generation for applications in artistic creation, virtual reality, and visual media, representing an incremental improvement over existing methods.
The paper tackled the problem of fusing visual cues from multiple images to create novel, coherent objects, addressing issues like coexistent and bias generation, and achieved superior performance in visual quality, semantic consistency, and creativity on a benchmark of 780 concept pairs.
Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.