Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction
This addresses the problem of computational inefficiency for users of video-to-video translation, enabling wider deployment, but it is incremental as it builds on existing models.
The paper tackles the high computational cost of video-to-video translation models by proposing Shortcut-V2V, a compression framework that reduces temporal redundancy, achieving comparable performance while saving 3.2-5.7x computational cost and 7.8-44x memory.
Video-to-video translation aims to generate video frames of a target domain from an input video. Despite its usefulness, the existing networks require enormous computations, necessitating their model compression for wide use. While there exist compression methods that improve computational efficiency in various image/video tasks, a generally-applicable compression method for video-to-video translation has not been studied much. In response, we present Shortcut-V2V, a general-purpose compression framework for video-to-video translation. Shourcut-V2V avoids full inference for every neighboring video frame by approximating the intermediate features of a current frame from those of the previous frame. Moreover, in our framework, a newly-proposed block called AdaBD adaptively blends and deforms features of neighboring frames, which makes more accurate predictions of the intermediate features possible. We conduct quantitative and qualitative evaluations using well-known video-to-video translation models on various tasks to demonstrate the general applicability of our framework. The results show that Shourcut-V2V achieves comparable performance compared to the original video-to-video translation model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time.