CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts
This addresses a practical challenge for deploying pose generation systems in real-world scenarios by bridging the gap between abstract language and detailed pose descriptions.
The paper tackles the problem of generating 3D human poses from abstract prompts, which are more natural for human communication, by introducing a chain-of-thought reasoning framework that interprets these prompts into accurate poses, resulting in plausible and semantically aligned outputs.
Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.