CVNov 9, 2018

STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification

arXiv:1811.04129v1233 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of identifying individuals across video frames for surveillance and security applications, representing an incremental improvement over existing methods.

The authors tackled video-based person re-identification by proposing a Spatial-Temporal Attention (STA) approach to generate robust clip-level feature representations, achieving an mAP of 87.7% on the MARS benchmark, which outperforms state-of-the-art methods by over 11.6%.

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person re-identification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e.g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i.e. MARS and DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11.6%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes