CVFeb 13, 2022

Data standardization for robust lip sync

arXiv:2202.06198v3
Originality Incremental advance
AI Analysis

This work addresses robustness issues in lip sync for audio-visual applications, but it is incremental as it builds on existing methods with a preprocessing step.

The paper tackles the problem of poor robustness in lip sync methods by proposing a data standardization pipeline that disentangles lip motion from distracting visual factors, enabling existing methods to improve data efficiency and achieve competitive performance in active speaker detection.

Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust in the wild. One important cause could be distracting factors on the visual input side, making extracting lip motion information difficult. To address these issues, this paper proposes a data standardization pipeline to standardize the visual input for lip sync. Based on recent advances in 3D face reconstruction, we first create a model that can consistently disentangle lip motion information from the raw images. Then, standardized images are synthesized with disentangled lip motion information, with all other attributes related to distracting factors set to predefined values independent of the input, to reduce their effects. Using synthesized images, existing lip sync methods improve their data efficiency and robustness, and they achieve competitive performance for the active speaker detection task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes