CVSep 24, 2024

Self-Supervised Any-Point Tracking by Contrastive Random Walks

arXiv:2409.16288v111 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the problem of precise point tracking in videos for computer vision applications, presenting an incremental improvement over existing self-supervised methods.

The paper tackles the Tracking Any Point (TAP) problem by introducing a self-supervised approach using a global matching transformer trained with contrastive random walks, achieving strong performance on TapVid benchmarks and outperforming previous self-supervised methods like DIFT while being competitive with supervised ones.

We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes