PersonMAE: Person Re-Identification Pre-Training with Masked AutoEncoders
This work addresses the problem of learning robust feature representations for person re-identification, which is crucial for surveillance and security applications, and it is incremental as it builds on masked autoencoders with task-specific adaptations.
The paper tackles person re-identification by proposing PersonMAE, a pre-training framework using masked autoencoders to learn features with multi-level awareness, occlusion robustness, and cross-region invariance, achieving state-of-the-art performance with mAP scores of 79.8% on MSMT17 and 69.5% on OccDuke, surpassing previous methods by +8.0 and +5.3 mAP respectively.
Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To this end, we propose a simple yet effective pre-training framework, namely PersonMAE, which involves two core designs into masked autoencoders to better serve the task of Person Re-ID. 1) PersonMAE generates two regions from the given image with RegionA as the input and \textit{RegionB} as the prediction target. RegionA is corrupted with block-wise masking to mimic common occlusion in ReID and its remaining visible parts are fed into the encoder. 2) Then PersonMAE aims to predict the whole RegionB at both pixel level and semantic feature level. It encourages its pre-trained feature representations with the three properties mentioned above. These properties make PersonMAE compatible with downstream Person ReID tasks, leading to state-of-the-art performance on four downstream ReID tasks, i.e., supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Notably, on the commonly adopted supervised setting, PersonMAE with ViT-B backbone achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, surpassing the previous state-of-the-art by a large margin of +8.0 mAP, and +5.3 mAP, respectively.