AISep 8, 2025

Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

arXiv:2509.06307v12 citationsh-index: 3Buildings
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited generalizability and interpretability in building energy retrofit decisions for practitioners, but it is incremental as it assesses existing LLMs without proposing new methods.

The study evaluated seven large language models (LLMs) on residential energy retrofit decisions, finding they generated effective recommendations with up to 54.5% top 1 match and 92.8% within top 5 without fine-tuning, though performance varied by objective and context.

Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states. LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context. Agreement across models is low, and higher performing models tend to diverge from others. LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior. Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness. Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes