CV AI ROMar 3, 2024

Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV & CribsTV

Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden

arXiv:2403.01569v19.67 citationsh-index: 25Has Code

Originality Incremental advance

AI Analysis

This addresses the lack of diverse training data for self-supervised depth estimation, enabling better generalization beyond urban driving scenes, though it is incremental in combining existing techniques with new data.

The paper tackles the problem of limited generalization in self-supervised monocular depth estimation by introducing two novel datasets, SlowTV and CribsTV, with 2M training frames from diverse environments, achieving zero-shot generalization that outperforms existing self-supervised and some supervised methods.

Self-supervised learning is the key to unlocking generic computer vision systems. By eliminating the reliance on ground-truth annotations, it allows scaling to much larger data quantities. Unfortunately, self-supervised monocular depth estimation (SS-MDE) has been limited by the absence of diverse training data. Existing datasets have focused exclusively on urban driving in densely populated cities, resulting in models that fail to generalize beyond this domain. To address these limitations, this paper proposes two novel datasets: SlowTV and CribsTV. These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames. They offer an incredibly diverse set of environments, ranging from snowy forests to coastal roads, luxury mansions and even underwater coral reefs. We leverage these datasets to tackle the challenging task of zero-shot generalization, outperforming every existing SS-MDE approach and even some state-of-the-art supervised methods. The generalization capabilities of our models are further enhanced by a range of components and contributions: 1) learning the camera intrinsics, 2) a stronger augmentation regime targeting aspect ratio changes, 3) support frame randomization, 4) flexible motion estimation, 5) a modern transformer-based architecture. We demonstrate the effectiveness of each component in extensive ablation experiments. To facilitate the development of future research, we make the datasets, code and pretrained models available to the public at https://github.com/jspenmar/slowtv_monodepth.

View on arXiv PDF Code

Similar