CVAIMay 27, 2025

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

arXiv:2505.21593v32 citationsh-index: 82
Originality Highly original
AI Analysis

This work addresses the challenge of producing stable and adjustable depth-of-field effects in videos for applications like video editing and simulation, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of generating temporally coherent and controllable video bokeh effects, which existing methods suffer from flickering and inconsistent blur, by proposing a one-step diffusion framework that uses a multi-plane image representation and progressive training, achieving superior performance in temporal coherence, spatial accuracy, and controllability on benchmarks.

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes