Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising
This work addresses the challenge of making synthetic videos more photorealistic for applications like simulation and content creation, representing an incremental improvement over existing methods.
The paper tackles the problem of enhancing synthetic video realism by proposing a zero-shot framework that preserves multi-level structures from synthetic videos in both spatial and temporal domains, resulting in outperforming existing baselines in structural consistency while maintaining state-of-the-art photorealism quality.
We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.