CVNov 30, 2020

Adaptive Compact Attention For Few-shot Video-to-video Translation

arXiv:2011.14695v1
AI Analysis

This work provides an incremental improvement for researchers and practitioners in video synthesis, specifically for few-shot video-to-video translation.

This paper addresses few-shot video-to-video translation by proposing an adaptive compact attention model that efficiently extracts contextual features from multiple reference images. The method achieves superior performance in producing photorealistic and temporally consistent videos, showing considerable improvements over state-of-the-art methods on talking-head and human dancing datasets.

This paper proposes an adaptive compact attention model for few-shot video-to-video translation. Existing works in this domain only use features from pixel-wise attention without considering the correlations among multiple reference images, which leads to heavy computation but limited performance. Therefore, we introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images, of which encoded view-dependent and motion-dependent information can significantly benefit the synthesis of realistic videos. Our core idea is to extract compact basis sets from all the reference images as higher-level representations. To further improve the reliability, in the inference phase, we also propose a novel method based on the Delaunay Triangulation algorithm to automatically select the resourceful references according to the input label. We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset; the experimental results show the superior performance of our method for producing photorealistic and temporally consistent videos, and considerable improvements over the state-of-the-art method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes