The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement
This enables improved depth sensing for point-and-shoot smartphone photography, particularly for tabletop scenes, though it is incremental as it builds on existing sensor data and optimization methods.
The paper tackles the problem of generating high-resolution depth maps for smartphone photography by leveraging natural handshake motion and existing sensors, achieving high-fidelity depth estimates for close-range objects without extra hardware or user interaction.
Modern smartphones can continuously stream multi-megapixel RGB images at 60Hz, synchronized with high-quality 3D pose information and low-resolution LiDAR-driven depth estimates. During a snapshot photograph, the natural unsteadiness of the photographer's hands offers millimeter-scale variation in camera pose, which we can capture along with RGB and depth in a circular buffer. In this work we explore how, from a bundle of these measurements acquired during viewfinding, we can combine dense micro-baseline parallax cues with kilopixel LiDAR depth to distill a high-fidelity depth map. We take a test-time optimization approach and train a coordinate MLP to output photometrically and geometrically consistent depth estimates at the continuous coordinates along the path traced by the photographer's natural hand shake. With no additional hardware, artificial hand motion, or user interaction beyond the press of a button, our proposed method brings high-resolution depth estimates to point-and-shoot "tabletop" photography -- textured objects at close range.