A Light-Weight Contrastive Approach for Aligning Human Pose Sequences
This is an incremental improvement for researchers and practitioners in human motion analysis, offering a simple and efficient tool for sequence alignment.
The paper tackles the problem of aligning human pose sequences by introducing an unsupervised contrastive learning method that maps 3D pose sequences into embeddings for dynamic time warping alignment, resulting in a fast and adaptable approach suitable for comparing and analyzing human behavior.
We present a simple unsupervised method for learning an encoder mapping short 3D pose sequences into embedding vectors suitable for sequence-to-sequence alignment by dynamic time warping. Training samples consist of temporal windows of frames containing 3D body points such as mocap markers or skeleton joints. A light-weight, 3-layer encoder is trained using a contrastive loss function that encourages embedding vectors of augmented sample pairs to have cosine similarity 1, and similarity 0 with all other samples in a minibatch. When multiple scripted training sequences are available, temporal alignments inferred from an initial round of training are harvested to extract additional, cross-performance match pairs for a second phase of training to refine the encoder. In addition to being simple, the proposed method is fast to train, making it easy to adapt to new data using different marker sets or skeletal joint layouts. Experimental results illustrate ease of use, transferability, and utility of the learned embeddings for comparing and analyzing human behavior sequences.