Shangjia Dong

LG
h-index2
7papers
215citations
Novelty41%
AI Score42

7 Papers

39.9CYMay 16
A Joint Synthetic Housing-Household Inventory

Xiao Qian, Shangjia Dong, Rachel Davidson

Accurately understanding the interactions between humans and the built environment requires integrated representations of both the buildings and the populations that occupy them. However, high-fidelity datasets that jointly capture detailed housing structures and demographic characteristics at the household level do not currently exist. This paper presents a framework for constructing a joint housing-household inventory that explicitly links individuals and households to compatible housing units from the National Structure Inventory (NSI), while preserving realistic population densities and demographic distributions. The framework integrates three components: (i) synthetic population generation from American Community Survey (ACS) Public Use Microdata Sample (PUMS) records that preserve complex intra-household relationships; (ii) a deep contrastive learning model that quantifies housing-household compatibility; and (iii) a hierarchical optimization-based allocation procedure that enforces building-level capacity and block-group-level demographic constraints. The generated synthetic population attains high statistical realism relative to the census microdata, and the contrastive learning model identifies compatible housing-household pairs with high predictive accuracy. Applied to coastal North Carolina, evaluations at building, neighborhood, and regional scales show that the joint inventory matches block-group-level demographic distributions, reproduces observed spatial population patterns without systematic bias, and maintains consistent allocation quality across urban, suburban, and rural contexts. By enabling coupled household- and building-level analyses, the resulting inventory supports a broad range of applications, including disaster resilience planning, housing and affordability analysis, energy-use assessment, and public health research.

31.5LGMar 31
PASM: Population Adaptive Symbolic Mixture-of-Experts Model for Cross-location Hurricane Evacuation Decision Prediction

Xiao Qian, Shangjia Dong

Accurate prediction of evacuation behavior is critical for disaster preparedness, yet models trained in one region often fail elsewhere. Using a multi-state hurricane evacuation survey, we show this failure goes beyond feature distribution shift: households with similar characteristics follow systematically different decision patterns across states. As a result, single global models overfit dominant responses, misrepresent vulnerable subpopulations, and generalize poorly across locations. We propose Population-Adaptive Symbolic Mixture-of-Experts (PASM), which pairs large language model guided symbolic regression with a mixture-of-experts architecture. PASM discovers human-readable closed-form decision rules, specializes them to data-driven subpopulations, and routes each input to the appropriate expert at inference time. On Hurricanes Harvey and Irma data, transferring from Florida and Texas to Georgia with 100 calibration samples, PASM achieves a Matthews correlation coefficient of 0.607, compared to XGBoost (0.404), TabPFN (0.333), GPT-5-mini (0.434), and meta-learning baselines MAML and Prototypical Networks (MCC $\leq$ 0.346). The routing mechanism assigns distinct formula archetypes to subpopulations, so the resulting behavioral profiles are directly interpretable. A fairness audit across four demographic axes finds no statistically significant disparities after Bonferroni correction. PASM closes more than half the cross-location generalization gap while keeping decision rules transparent enough for real-world emergency planning.

LGFeb 16, 2025
Deep Contrastive Learning for Feature Alignment: Insights from Housing-Household Relationship Inference

Xiao Qian, Shangjia Dong, Rachel Davidson

Housing and household characteristics are key determinants of social and economic well-being, yet our understanding of their interrelationships remains limited. This study addresses this knowledge gap by developing a deep contrastive learning (DCL) model to infer housing-household relationships using the American Community Survey (ACS) Public Use Microdata Sample (PUMS). More broadly, the proposed model is suitable for a class of problems where the goal is to learn joint relationships between two distinct entities without explicitly labeled ground truth data. Our proposed dual-encoder DCL approach leverages co-occurrence patterns in PUMS and introduces a bisect K-means clustering method to overcome the absence of ground truth labels. The dual-encoder DCL architecture is designed to handle the semantic differences between housing (building) and household (people) features while mitigating noise introduced by clustering. To validate the model, we generate a synthetic ground truth dataset and conduct comprehensive evaluations. The model further demonstrates its superior performance in capturing housing-household relationships in Delaware compared to state-of-the-art methods. A transferability test in North Carolina confirms its generalizability across diverse sociodemographic and geographic contexts. Finally, the post-hoc explainable AI analysis using SHAP values reveals that tenure status and mortgage information play a more significant role in housing-household matching than traditionally emphasized factors such as the number of persons and rooms.

LGJun 30, 2024
A Deep Generative Framework for Joint Households and Individuals Population Synthesis

Xiao Qian, Utkarsh Gangwal, Shangjia Dong et al.

Household and individual-level sociodemographic data are essential for understanding human-infrastructure interaction and policymaking. However, the Public Use Microdata Sample (PUMS) offers only a sample at the state level, while census tract data only provides the marginal distributions of variables without correlations. Therefore, we need an accurate synthetic population dataset that maintains consistent variable correlations observed in microdata, preserves household-individual and individual-individual relationships, adheres to state-level statistics, and accurately represents the geographic distribution of the population. We propose a deep generative framework leveraging the variational autoencoder (VAE) to generate a synthetic population with the aforementioned features. The methodological contributions include (1) a new data structure for capturing household-individual and individual-individual relationships, (2) a transfer learning process with pre-training and fine-tuning steps to generate households and individuals whose aggregated distributions align with the census tract marginal distribution, and (3) decoupled binary cross-entropy (D-BCE) loss function enabling distribution shift and out-of-sample records generation. Model results for an application in Delaware, USA demonstrate the ability to ensure the realism of generated household-individual records and accurately describe population statistics at the census tract level compared to existing methods. Furthermore, testing in North Carolina, USA yielded promising results, supporting the transferability of our method.

SOC-PHAug 30, 2021
Predicting Road Flooding Risk with Machine Learning Approaches Using Crowdsourced Reports and Fine-grained Traffic Data

Faxi Yuan, William Mobley, Hamed Farahmand et al.

The objective of this study is to predict road flooding risks based on topographic, hydrologic, and temporal precipitation features using machine learning models. Predictive flood monitoring of road network flooding status plays an essential role in community hazard mitigation, preparedness, and response activities. Existing studies related to the estimation of road inundations either lack observed road inundation data for model validations or focus mainly on road inundation exposure assessment based on flood maps. This study addresses this limitation by using crowdsourced and fine-grained traffic data as an indicator of road inundation, and topographic, hydrologic, and temporal precipitation features as predictor variables. Two tree-based machine learning models (random forest and AdaBoost) were then tested and trained for predicting road inundations in the contexts of 2017 Hurricane Harvey and 2019 Tropical Storm Imelda in Harris County, Texas. The findings from Hurricane Harvey indicate that precipitation is the most important feature for predicting road inundation susceptibility, and that topographic features are more essential than hydrologic features for predicting road inundations in both storm cases. The random forest and AdaBoost models had relatively high AUC scores (0.860 and 0.810 for Harvey respectively and 0.790 and 0.720 for Imelda respectively) with the random forest model performing better in both cases. The random forest model showed stable performance for Harvey, while varying significantly for Imelda. This study advances the emerging field of smart flood resilience in terms of predictive flood risk mapping at the road level. For example, such models could help impacted communities and emergency management agencies develop better preparedness and response strategies with improved situational awareness of road inundation likelihood as an extreme weather event unfolds.

AIApr 20, 2021
Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning

Zhenning Li, Hao Yu, Guohui Zhang et al.

Inefficient traffic control may cause numerous problems such as traffic congestion and energy waste. This paper proposes a novel multi-agent reinforcement learning method, named KS-DDPG (Knowledge Sharing Deep Deterministic Policy Gradient) to achieve optimal control by enhancing the cooperation between traffic signals. By introducing the knowledge-sharing enabled communication protocol, each agent can access to the collective representation of the traffic environment collected by all agents. The proposed method is evaluated through two experiments respectively using synthetic and real-world datasets. The comparison with state-of-the-art reinforcement learning-based and conventional transportation methods demonstrate the proposed KS-DDPG has significant efficiency in controlling large-scale transportation networks and coping with fluctuations in traffic flow. In addition, the introduced communication mechanism has also been proven to speed up the convergence of the model without significantly increasing the computational burden.

SPJun 15, 2020
A Hybrid Deep Learning Model for Predictive Flood Warning and Situation Awareness using Channel Network Sensors Data

Shangjia Dong, Tianbo Yu, Hamed Farahmand et al.

The objective of this study is to create and test a hybrid deep learning model, FastGRNN-FCN (Fast, Accurate, Stable and Tiny Gated Recurrent Neural Network-Fully Convolutional Network), for urban flood prediction and situation awareness using channel network sensors data. The study used Harris County, Texas as the testbed, and obtained channel sensor data from three historical flood events (e.g., 2016 Tax Day Flood, 2016 Memorial Day flood, and 2017 Hurricane Harvey Flood) for training and validating the hybrid deep learning model. The flood data are divided into a multivariate time series and used as the model input. Each input comprises nine variables, including information of the studied channel sensor and its predecessor and successor sensors in the channel network. Precision-recall curve and F-measure are used to identify the optimal set of model parameters. The optimal model with a weight of 1 and a critical threshold of 0.59 are obtained through one hundred iterations based on examining different weights and thresholds. The test accuracy and F-measure eventually reach 97.8% and 0.792, respectively. The model is then tested in predicting the 2019 Imelda flood in Houston and the results show an excellent match with the empirical flood. The results show that the model enables accurate prediction of the spatial-temporal flood propagation and recession and provides emergency response officials with a predictive flood warning tool for prioritizing the flood response and resource allocation strategies.