Repurposing Video Diffusion Transformers for Robust Point Tracking
This addresses the need for more reliable point tracking in applications such as 4D reconstruction, robotics, and video editing, representing an incremental improvement by adapting existing video DiTs.
The paper tackled the problem of point tracking in videos by repurposing pre-trained video Diffusion Transformers (DiTs) to improve temporal coherence and robustness under challenging conditions like dynamic motions and occlusions, resulting in DiTracker achieving state-of-the-art performance on the ITTO benchmark and matching or outperforming state-of-the-art on TAP-Vid benchmarks with 8 times smaller batch size.
Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.