CVSep 18, 2025

RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes

arXiv:2509.15123v2h-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of camera parameter estimation for researchers and practitioners in computer vision, enabling applications in dynamic scene reconstruction without needing hard-to-obtain ground truth data, though it is an incremental improvement over prior work.

The paper tackles the problem of camera parameter optimization in dynamic scenes, which traditionally requires extensive supervision like motion masks or depth, by proposing ROS-Cam, a method that uses only a single RGB video as input, achieving more accurate and efficient results compared to existing methods.

Although COLMAP has long remained the predominant method for camera parameter optimization in static scenes, it is constrained by its lengthy runtime and reliance on ground truth (GT) motion masks for application to dynamic scenes. Many efforts attempted to improve it by incorporating more priors as supervision such as GT focal length, motion masks, 3D point clouds, camera poses, and metric depth, which, however, are typically unavailable in casually captured RGB videos. In this paper, we propose a novel method for more accurate and efficient camera parameter optimization in dynamic scenes solely supervised by a single RGB video, dubbed ROS-Cam. Our method consists of three key components: (1) Patch-wise Tracking Filters, to establish robust and maximally sparse hinge-like relations across the RGB video. (2) Outlier-aware Joint Optimization, for efficient camera parameter optimization by adaptive down-weighting of moving outliers, without reliance on motion priors. (3) A Two-stage Optimization Strategy, to enhance stability and optimization speed by a trade-off between the Softplus limits and convex minima in losses. We visually and numerically evaluate our camera estimates. To further validate accuracy, we feed the camera estimates into a 4D reconstruction method and assess the resulting 3D scenes, and rendered 2D RGB and depth maps. We perform experiments on 4 real-world datasets (NeRF-DS, DAVIS, iPhone, and TUM-dynamics) and 1 synthetic dataset (MPI-Sintel), demonstrating that our method estimates camera parameters more efficiently and accurately with a single RGB video as the only supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes