CV AIMay 27, 2025

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

arXiv:2505.21593v38.42 citationsh-index: 82

Originality Highly original

AI Analysis

This work addresses the challenge of producing stable and adjustable depth-of-field effects in videos for applications like video editing and simulation, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of generating temporally coherent and controllable video bokeh effects, which existing methods suffer from flickering and inconsistent blur, by proposing a one-step diffusion framework that uses a multi-plane image representation and progressive training, achieving superior performance in temporal coherence, spatial accuracy, and controllability on benchmarks.

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.

View on arXiv PDF

Similar