CVMay 16

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

Saeed Firouzi Daghigh, Majid Iranpour Mobarekeh, Mostafa Alavi, Mehdi Bagheri

arXiv:2605.169189.1Has Code

Predicted impact top 76% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For professionals in film and broadcast industries, HighSync provides a viable solution for high-fidelity lip synchronization without compromising image quality or temporal consistency.

HighSync is the first lip sync model to operate natively at 512x512 resolution, achieving state-of-the-art performance in both perceptual quality and synchronization accuracy by eliminating a data leakage phenomenon that undermined temporal modeling in prior work.

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync

View on arXiv PDF Code

Similar