CVAIJul 19, 2023

Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation

arXiv:2307.09906v376 citationsh-index: 42Has Code
Originality Incremental advance
AI Analysis

This addresses video generation quality issues for applications like virtual avatars or entertainment, but it is incremental as it builds on existing talking head generation methods.

The paper tackles the problem of generating high-fidelity talking head videos from a still source image and a driving video, where dramatic motions cause artifacts due to insufficient appearance information. The proposed MCNet uses an implicit identity representation conditioned memory compensation network to learn facial priors, outperforming previous state-of-the-art methods on VoxCeleb1 and CelebV datasets.

Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes