Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery
For urban planners and homeowners, this provides a scalable, low-cost method for nationwide building condition assessment, though it is an incremental application of existing LLM and distillation techniques.
The paper presents a framework using fine-tuned Gemma 3 27B and knowledge distillation to assess building conditions from street-view imagery, achieving strong alignment with human mean opinion scores and outperforming individual raters on SRCC and PLCC. Distilled models (Gemma 3 4B, EfficientNetV2-M, SwinV2-B) achieve comparable performance with up to 30x speedup.
We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.