CVOct 27, 2022

Learning Joint Representation of Human Motion and Language

Jihoon Kim, Youngjae Yu, Seungyoun Shin, Taehyun Byun, Sungjoon Choi

arXiv:2210.15187v14.85 citationsh-index: 20

Originality Incremental advance

AI Analysis

This work addresses the challenge of connecting human motion and language for tasks like action recognition and motion retrieval, which is incremental as it builds on existing multimodal representation learning approaches.

The paper tackles the problem of learning joint representations of human motion and language by introducing MoLang, a model that uses contrastive learning with both unpaired and paired datasets, resulting in outperforming state-of-the-art methods on action recognition benchmarks.

In this work, we present MoLang (a Motion-Language connecting model) for learning joint representation of human motion and language, leveraging both unpaired and paired datasets of motion and language modalities. To this end, we propose a motion-language model with contrastive learning, empowering our model to learn better generalizable representations of the human motion domain. Empirical results show that our model learns strong representations of human motion data through navigating language modality. Our proposed method is able to perform both action recognition and motion retrieval tasks with a single model where it outperforms state-of-the-art approaches on a number of action recognition benchmarks.

View on arXiv PDF

Similar