It's a Matter of Time: Three Lessons on Long-Term Motion for Perception
This work provides foundational insights for designing models that leverage temporal data, potentially impacting various perceptual tasks in computer vision.
The paper investigates the role of long-term motion information in perception, finding that it outperforms image representations in understanding actions, objects, materials, and spatial information, generalizes better in low-data and zero-shot settings, and offers a more efficient trade-off between computational cost and accuracy.
Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.