CV MLMay 10

An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

Arafat Rahman, Shashwat Kumar, Laura E. Barnes, Anuj Srivastava

arXiv:2605.0923112.8

AI Analysis

For researchers in human pose analysis and generative modeling, ES-VAE offers a principled way to learn latent representations of skeletal sequences that are invariant to non-shape variations, improving downstream task performance.

The paper proposes the Elastic Shape Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that uses the TSRVF representation on Kendall's shape manifold to remove nuisance factors like camera orientation and execution speed. ES-VAE outperforms standard VAEs and baselines (TCN, transformers, GCNs) on gait analysis (clinical mobility score prediction, stroke classification) and action recognition (NTU RGB+D).

Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.

View on arXiv PDF

Similar