SD AI HC ASNov 20, 2024

Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, Pål Halvorsen

arXiv:2411.13209v14.92 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

It addresses latency issues in real-time talking-head synthesis for interviewer training, but the approach is incremental as it adapts an existing model to a specific application.

This paper tackled the problem of latency in real-time talking-head generation for interviewer training by replacing conventional audio feature extraction models with OpenAI's Whisper, resulting in accelerated processing and improved rendering quality for more realistic interactions.

This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.

View on arXiv PDF Code

Similar