Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors
This work addresses the need for reliable depth and pose estimation in gastrointestinal endoscopy, which is incremental as it builds on prior self-supervised methods with specific enhancements.
The paper tackled the problem of accurate 3D mapping in endoscopy by proposing a self-supervised framework for monocular depth and pose estimation, achieving superior performance over existing methods on endoscopic datasets.
Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.