T3M: Text Guided 3D Human Motion Synthesis from Speech
This addresses the need for more accurate and customizable motion synthesis in applications like virtual reality, gaming, and film production, representing a novel improvement over existing approaches.
The paper tackles the problem of inaccurate and inflexible speech-driven 3D human motion synthesis by introducing T3M, a method that uses textual input for precise control, resulting in greatly outperforming state-of-the-art methods in quantitative metrics and qualitative evaluations.
Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \textit{T3M}. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at \href{https://github.com/Gloria2tt/T3M.git}{https://github.com/Gloria2tt/T3M.git}