A Symbolic Temporal Pooling method for Video-based Person Re-Identification
This work addresses a specific bottleneck in video-based person re-identification for surveillance and security applications, offering an incremental improvement over existing methods.
The paper tackled the problem of losing discriminating information in video-based person re-identification due to max/avg pooling by introducing a symbolic temporal pooling method that represents frame-level features using empirical cumulative distribution functions, resulting in consistent performance improvements across four datasets.
In video-based person re-identification, both the spatial and temporal features are known to provide orthogonal cues to effective representations. Such representations are currently typically obtained by aggregating the frame-level features using max/avg pooling, at different points of the models. However, such operations also decrease the amount of discriminating information available, which is particularly hazardous in case of poor separability between the different classes. To alleviate this problem, this paper introduces a symbolic temporal pooling method, where frame-level features are represented in the distribution valued symbolic form, yielding from fitting an Empirical Cumulative Distribution Function (ECDF) to each feature. Also, considering that the original triplet loss formulation cannot be applied directly to this kind of representations, we introduce a symbolic triplet loss function that infers the similarity between two symbolic objects. Having carried out an extensive empirical evaluation of the proposed solution against the state-of-the-art, in four well known data sets (MARS, iLIDS-VID, PRID2011 and P-DESTRE), the observed results point for consistent improvements in performance over the previous best performing techniques.