Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation
This work addresses the problem of generating stable and artifact-free human animations for applications in graphics and AI, representing an incremental improvement over prior unidirectional methods.
The paper tackles the problem of generating temporally coherent human animation from inputs like a single image, video, or noise, addressing motion drifting and artifacts in unidirectional methods. The result is a bidirectional temporal diffusion model that shows strong performance with realistic temporal coherence compared to existing approaches.
We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.