DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes
This addresses depth estimation in challenging environments like stores or stations, but it is incremental as it builds on existing methods with specific constraints.
The paper tackles the problem of estimating dense depth maps from monocular video in crowded dynamic indoor scenes, achieving consistent improvements over recent methods on the NAVERLABS dataset.
We present a novel approach for estimating depth from a monocular camera as it moves through complex and crowded indoor environments, e.g., a department store or a metro station. Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people, by training on dynamic scenes. Since it is difficult to collect dense depth maps from crowded indoor environments, we design our training framework without requiring depths produced from depth sensing devices. Our network leverages RGB images and sparse depth maps generated from traditional 3D reconstruction methods to estimate dense depth maps. We use two constraints to handle depth for non-rigidly moving people without tracking their motion explicitly. We demonstrate that our approach offers consistent improvements over recent depth estimation methods on the NAVERLABS dataset, which includes complex and crowded scenes.