CVOct 31, 2025

Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

Bo Li, Duyuan Zheng, Xinyang Liu, Qingwen Li, Hong Li, Hongyan Cui, Ge Gao, Chen Liu

arXiv:2510.27677v13.6h-index: 5

Originality Highly original

AI Analysis

This addresses the challenge of robust person re-identification under occlusion and poor conditions for surveillance applications, representing a strong specific gain.

The paper tackles the problem of occluded person re-identification in surveillance by proposing Sh-ViT, a lightweight Vision Transformer model, which achieves 83.2% Rank-1 and 80.1% mAP on a new dataset and outperforms state-of-the-art methods on benchmarks.

Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.

View on arXiv PDF

Similar