CVDec 11, 2020

Intrinsic Temporal Regularization for High-resolution Human Video Synthesis

arXiv:2012.06134v1
AI Analysis

This work provides an incremental improvement for researchers and practitioners working on high-resolution human video synthesis by enhancing temporal consistency.

The paper addresses the challenge of temporal consistency in high-resolution human video synthesis, where flow-based warping is unreliable due to misalignment and inaccurate flow estimation. They introduce an intrinsic temporal regularization scheme that uses a confidence map from the frame generator to modulate motion estimation via temporal loss, resulting in a "INTERnet" capable of generating 512x512 resolution human action videos with improved temporal coherence and realistic details.

Temporal consistency is crucial for extending image processing pipelines to the video domain, which is often enforced with flow-based warping error over adjacent frames. Yet for human video synthesis, such scheme is less reliable due to the misalignment between source and target video as well as the difficulty in accurate flow estimation. In this paper, we propose an effective intrinsic temporal regularization scheme to mitigate these issues, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation via temporal loss modulation. This creates a shortcut for back-propagating temporal loss gradients directly to the front-end motion estimator, thus improving training stability and temporal coherence in output videos. We apply our intrinsic temporal regulation to single-image generator, leading to a powerful "INTERnet" capable of generating $512\times512$ resolution human action videos with temporal-coherent, realistic visual details. Extensive experiments demonstrate the superiority of proposed INTERnet over several competitive baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes