Hear Your Face: Face-based voice conversion with F0 estimation
This addresses voice conversion for applications like personalized speech synthesis, but it is incremental as it builds on existing face-based methods by focusing on F0 estimation.
The paper tackles face-based voice conversion by using facial images to estimate the average fundamental frequency (F0) of a target speaker, resulting in superior speech generation quality and alignment of facial features with voice characteristics.
This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.