CVDec 13, 2025

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

arXiv:2512.12165v2
Originality Incremental advance
AI Analysis

This work addresses camera motion estimation for embodied perception and 3D scene understanding, introducing a novel audio-visual approach that is incremental but effective in real-world videos.

The paper tackles the problem of camera pose estimation in visually degraded conditions by leveraging passive scene sounds as complementary cues, achieving consistent gains over visual baselines on two large datasets and demonstrating robustness when visual information is corrupted.

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes