CVROMar 19, 2024

TAPTR: Tracking Any Point with Transformers as Detection

arXiv:2403.13042v151 citationsECCV
Originality Incremental advance
AI Analysis

This work addresses the challenge of precise point tracking in videos for computer vision applications, representing an incremental improvement by combining existing designs from object detection and optical flow models.

The paper tackles the problem of tracking any point in videos by proposing TAPTR, a framework that adapts DETR-like object detection transformers to point tracking, achieving state-of-the-art performance on various datasets with faster inference speed.

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes