CVOct 10, 2025

Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

arXiv:2510.09182v11 citationsh-index: 19
Originality Highly original
AI Analysis

This enables real-time, temporally-consistent depth prediction for computer vision systems on edge devices, addressing a deployment bottleneck.

The paper tackled the problem of batch-processing limitations in video depth estimation by introducing an online method that uses caching and masking techniques from LLMs, achieving 42 FPS on an A100 and 20 FPS on a Jetson edge device with lower VRAM usage than competitors.

Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes