CV AI LGDec 19, 2023

Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

Hamza Mukhtar, Muhammad Usman Ghani Khan

arXiv:2312.11929v11.5h-index: 6

Originality Highly original

AI Analysis

This work addresses tracking challenges like occlusions and non-uniform movements for applications in surveillance and robotics, representing a novel method for a known bottleneck.

The paper tackles the problem of multi-object tracking in unconstrained environments by proposing STMMOT, an end-to-end framework that integrates object detection and identity linkage, achieving state-of-the-art performance with a MOTA of 72.3% on the MOT17 benchmark.

Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing.

View on arXiv PDF

Similar