CVAIOct 17, 2022

Track Targets by Dense Spatio-Temporal Position Encoding

CMU
arXiv:2210.09455v117 citationsh-index: 46
Originality Highly original
AI Analysis

This addresses the problem of associating targets across frames in video tracking for computer vision applications, representing an incremental advance with a new encoding method.

The paper tackles target tracking in videos by introducing Dense Spatio-Temporal (DST) position encoding, a novel paradigm that encodes pixel-wise spatio-temporal position information using transformers, resulting in improved performance on MOT and MOTS datasets.

In this work, we propose a novel paradigm to encode the position of targets for target tracking in videos using transformers. The proposed paradigm, Dense Spatio-Temporal (DST) position encoding, encodes spatio-temporal position information in a pixel-wise dense fashion. The provided position encoding provides location information to associate targets across frames beyond appearance matching by comparing objects in two bounding boxes. Compared to the typical transformer positional encoding, our proposed encoding is applied to the 2D CNN features instead of the projected feature vectors to avoid losing positional information. Moreover, the designed DST encoding can represent the location of a single-frame object and the evolution of the location of the trajectory among frames uniformly. Integrated with the DST encoding, we build a transformer-based multi-object tracking model. The model takes a video clip as input and conducts the target association in the clip. It can also perform online inference by associating existing trajectories with objects from the new-coming frames. Experiments on video multi-object tracking (MOT) and multi-object tracking and segmentation (MOTS) datasets demonstrate the effectiveness of the proposed DST position encoding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes