C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer
This work provides an incremental improvement in human video motion transfer for video synthesis researchers and practitioners, specifically addressing issues of spatial and temporal consistency.
This paper addresses human video motion transfer (HVMT) by proposing C2F-FWN, a Coarse-to-Fine Flow Warping Network. It improves spatial consistency by using coarse-to-fine flow warping and Layout-Constrained Deformable Convolution, and enhances temporal consistency with Flow Temporal Consistency Loss. Experiments on the SoloDance and iPER datasets show it outperforms state-of-the-art methods in both spatial and temporal consistency.
Human video motion transfer (HVMT) aims to synthesize videos that one person imitates other persons' actions. Although existing GAN-based HVMT methods have achieved great success, they either fail to preserve appearance details due to the loss of spatial consistency between synthesized and exemplary images, or generate incoherent video results due to the lack of temporal consistency among video frames. In this paper, we propose Coarse-to-Fine Flow Warping Network (C2F-FWN) for spatial-temporal consistent HVMT. Particularly, C2F-FWN utilizes coarse-to-fine flow warping and Layout-Constrained Deformable Convolution (LC-DConv) to improve spatial consistency, and employs Flow Temporal Consistency (FTC) Loss to enhance temporal consistency. In addition, provided with multi-source appearance inputs, C2F-FWN can support appearance attribute editing with great flexibility and efficiency. Besides public datasets, we also collected a large-scale HVMT dataset named SoloDance for evaluation. Extensive experiments conducted on our SoloDance dataset and the iPER dataset show that our approach outperforms state-of-art HVMT methods in terms of both spatial and temporal consistency. Source code and the SoloDance dataset are available at https://github.com/wswdx/C2F-FWN.