CVMar 25, 2022

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Fangzhou Hong, Liang Pan, Zhongang Cai, Ziwei Liu

arXiv:2203.13815v116.326 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This work addresses data efficiency for researchers and practitioners in vision and graphics, but it is incremental as it builds on existing contrastive learning methods.

The paper tackles the problem of expensive data annotations in human-centric perception by proposing HCMoCo, a versatile multi-modal pre-training framework that leverages RGB, depth, and 2D keypoints, resulting in improvements such as 7.16% and 12% on DensePose Estimation and Human Parsing under data-efficient settings.

Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.

View on arXiv PDF Code

Similar