Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
This work addresses the challenge of enhancing immersive speech experiences in VTTS for applications like virtual reality, though it appears incremental by building on prior RGB-based approaches.
The paper tackles the problem of generating immersive speech in Visual Text-to-Speech (VTTS) by integrating multi-source spatial knowledge like depth, speaker position, and semantics, resulting in a method that surpasses existing baselines in immersive speech generation.
Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of each source.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive speech experience. Experimental results demonstrate that the MS$^2$KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/AI-S2-Lab/MS2KU-VTTS.