CV GR ROAug 14, 2023

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee

arXiv:2308.07301v27.612 citationsh-index: 36

Originality Incremental advance

AI Analysis

This work addresses motion synthesis for applications like animation or robotics by offering a task-independent model, though it appears incremental as it builds on existing Vision Transformer ideas and reformulates tasks as reconstruction problems.

The paper tackles the problem of synthesizing human motion across various tasks, such as forecasting and inbetweening, by proposing a unified model called UNIMASK-M that achieves state-of-the-art results on datasets like Human3.6M and LaFAN1, particularly excelling in long transition periods.

The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://evm7.github.io/UNIMASKM-page/

View on arXiv PDF

Similar