CVAIApr 4, 2025

Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video

arXiv:2504.03198v18 citationsh-index: 5MICCAI
Originality Incremental advance
AI Analysis

This work addresses a critical problem for computer-assisted surgery by enabling real-time, accurate 3D scene reconstruction from surgical videos, though it builds incrementally on existing pairwise reconstruction models.

The paper tackles the challenge of achieving scale-consistent 3D reconstruction from monocular endoscopic videos, which is hindered by dynamic deformations and textureless surfaces, and presents Endo3R, a unified model that predicts globally aligned pointmaps, scale-consistent depths, and camera parameters online without priors, demonstrating superior performance in zero-shot depth prediction and pose estimation on SCARED and Hamlyn datasets.

Reconstructing 3D scenes from monocular surgical videos can enhance surgeon's perception and therefore plays a vital role in various computer-assisted surgery tasks. However, achieving scale-consistent reconstruction remains an open challenge due to inherent issues in endoscopic videos, such as dynamic deformations and textureless surfaces. Despite recent advances, current methods either rely on calibration or instrument priors to estimate scale, or employ SfM-like multi-stage pipelines, leading to error accumulation and requiring offline optimization. In this paper, we present Endo3R, a unified 3D foundation model for online scale-consistent reconstruction from monocular surgical video, without any priors or extra optimization. Our model unifies the tasks by predicting globally aligned pointmaps, scale-consistent video depths, and camera parameters without any offline optimization. The core contribution of our method is expanding the capability of the recent pairwise reconstruction model to long-term incremental dynamic reconstruction by an uncertainty-aware dual memory mechanism. The mechanism maintains history tokens of both short-term dynamics and long-term spatial consistency. Notably, to tackle the highly dynamic nature of surgical scenes, we measure the uncertainty of tokens via Sampson distance and filter out tokens with high uncertainty. Regarding the scarcity of endoscopic datasets with ground-truth depth and camera poses, we further devise a self-supervised mechanism with a novel dynamics-aware flow loss. Abundant experiments on SCARED and Hamlyn datasets demonstrate our superior performance in zero-shot surgical video depth prediction and camera pose estimation with online efficiency. Project page: https://wrld.github.io/Endo3R/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes