FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution
This work addresses the need for accurate, consistent, and fast depth estimation in applications like video editing and robotics, though it is incremental as it builds on pretrained single-image models.
The paper tackled the problem of real-time, high-resolution video depth estimation by proposing FlashDepth, which achieves depth estimation on 2044x1148 streaming video at 24 FPS, outperforming state-of-the-art models in boundary sharpness and speed while maintaining competitive accuracy.
A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth