CVNov 26, 2025

Seeing without Pixels: Perception from Camera Trajectories

arXiv:2511.21681v12 citationsh-index: 43
Originality Highly original
AI Analysis

This introduces a novel, lightweight modality for video perception, potentially benefiting fields like robotics and surveillance.

The paper tackles the problem of perceiving video content without pixel data by using only camera trajectories, and finds that camera movement is highly informative for tasks like cross-modal alignment and classification.

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes