CVNov 11, 2025

Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation

arXiv:2511.08258v1h-index: 14
Originality Incremental advance
AI Analysis

This addresses the problem of aerial-to-ground view generation for applications like urban planning or navigation, but it is incremental as it builds on existing diffusion models with specific conditioning.

The paper tackles the problem of generating ground-level images from aerial views, which is challenging due to viewpoint disparity and occlusions, and introduces Top2Ground, a diffusion-based method that achieves a 7.3% average improvement in SSIM across three benchmark datasets.

Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes