CVAIJul 14, 2025

Cameras as Relative Positional Encoding

arXiv:2507.10496v257 citationsh-index: 54
Originality Incremental advance
AI Analysis

This addresses the challenge of grounding visual tokens in 3D space for multi-view computer vision, offering a method that enhances geometric conditioning in transformers, though it is incremental relative to existing camera encoding techniques.

The paper tackled the problem of effectively incorporating camera geometry into multi-view transformers for 3D perception tasks, proposing Projective Positional Encoding (PRoPE) and showing it improves performance in novel view synthesis, with gains across varied settings and tasks like stereo depth estimation.

Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes