Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry
This addresses a critical safety risk in depth estimation for applications like autonomous driving, though it is incremental as it builds on existing foundation models.
The paper tackles the problem of monocular depth estimation models hallucinating 3D structures from geometrically planar but perceptually ambiguous inputs (termed the 3D Mirage), and introduces a benchmark, evaluation metrics, and a parameter-efficient mitigation strategy called Grounded Self-Distillation to enforce planarity while preserving background knowledge.
Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.