CVSDASFeb 27, 2023

Cross-modal Face- and Voice-style Transfer

arXiv:2302.13838v22 citationsh-index: 27
AI Analysis

This addresses the challenge of cross-modal impression matching for content creation, offering a novel framework but is incremental in combining existing tasks.

The paper tackles the problem of generating faces and voices that match each other's style across modalities, proposing XFaVoT to jointly learn image translation and voice conversion tasks, which outperforms baselines in quality, diversity, and face-voice correspondence on multiple datasets.

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes