How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models
This addresses video quality issues in diffusion-based methods for researchers and practitioners in video editing and generation, representing an incremental improvement over existing noise sampling techniques.
The paper tackled the problem of temporal artifacts like flickering and texture-sticking in video editing and generation using diffusion models, by proposing a novel noise representation called ∫-noise and a tailored transport method to preserve correlations across frames, resulting in improved video quality for tasks such as video restoration and conditional generation.
Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed $\int$-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses $\int$-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed $\int$-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See https://warpyournoise.github.io/ for video results.