CVJun 14, 2023

TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

arXiv:2306.08637v2311 citationsh-index: 188
Originality Highly original
AI Analysis

This addresses the challenge of precise point tracking in videos for applications like animation and video analysis, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles the problem of tracking any point on physical surfaces in videos by introducing a two-stage model with per-frame initialization and temporal refinement, achieving an approximate 20% absolute average Jaccard improvement on the TAP-Vid benchmark.

We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes