Generalizable Person Re-Identification via Viewpoint Alignment and Fusion
This work addresses a domain generalization issue in person re-identification for surveillance and security applications, offering an incremental improvement by focusing on viewpoint alignment.
The paper tackles the problem of poor generalization in person re-identification due to unpredictable camera viewpoint changes by proposing a method that uses 3D dense pose estimation and texture mapping to create canonical view images, then fuses them with original images via a transformer-based module to compensate for lost details, achieving superior performance over existing approaches in experiments.
In the current person Re-identification (ReID) methods, most domain generalization works focus on dealing with style differences between domains while largely ignoring unpredictable camera view change, which we identify as another major factor leading to a poor generalization of ReID methods. To tackle the viewpoint change, this work proposes to use a 3D dense pose estimation model and a texture mapping module to map the pedestrian images to canonical view images. Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images, and thus directly using them for ReID will inevitably result in poor performance. To handle this issue, we propose to fuse the original image and canonical view image via a transformer-based module. The key insight of this design is that the cross-attention mechanism in the transformer could be an ideal solution to align the discriminative texture clues from the original image with the canonical view image, which could compensate for the low-quality texture information of the canonical view image. Through extensive experiments, we show that our method can lead to superior performance over the existing approaches in various evaluation settings.