Towards Consistent Long-Term Pose Generation
This addresses the challenge of maintaining temporal coherence in pose generation for applications like animation or robotics, though it appears incremental as it builds on existing generation paradigms.
The paper tackles the problem of degraded performance in long-term pose generation due to reliance on intermediate representations, proposing a one-stage architecture that directly generates poses from minimal context. The result shows significant outperformance over existing methods, particularly in long-term scenarios, as demonstrated on Penn Action and F-PHAB datasets.
Current approaches to pose generation rely heavily on intermediate representations, either through two-stage pipelines with quantization or autoregressive models that accumulate errors during inference. This fundamental limitation leads to degraded performance, particularly in long-term pose generation where maintaining temporal coherence is crucial. We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context - a single RGB image and text description - while maintaining consistent distributions between training and inference. Our key innovation is eliminating the need for intermediate representations or token-based generation by operating directly on pose coordinates through a relative movement prediction mechanism that preserves spatial relationships, and a unified placeholder token approach that enables single-forward generation with identical behavior during training and inference. Through extensive experiments on Penn Action and First-Person Hand Action Benchmark (F-PHAB) datasets, we demonstrate that our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.