How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits
For researchers in urban perception and explainable AI, this work provides a structured method to move from correlational to causal scene-level explanations, though it is preliminary and lacks human validation.
The paper proposes a counterfactual framework for urban perception that identifies which localized visual edits shift human safety judgments in street-view images. In a pilot across 50 scenes, Mobility Infrastructure and Physical Maintenance edits produced the largest safety shifts, but human validation remains pending.
Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.