George Gui

AI
h-index32
4papers
87citations
Novelty41%
AI Score33

4 Papers

AIDec 24, 2023
The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

George Gui, Olivier Toubia

Large Language Models (LLMs) have shown impressive potential to simulate human behavior. We identify a fundamental challenge in using them to simulate experiments: when LLM-simulated subjects are blind to the experimental design (as is standard practice with human subjects), variations in treatment systematically affect unspecified variables that should remain constant, violating the unconfoundedness assumption. Using demand estimation as a context and an actual experiment as a benchmark, we show this can lead to implausible results. While confounding may in principle be addressed by controlling for covariates, this can compromise ecological validity in the context of LLM simulations: controlled covariates become artificially salient in the simulated decision process, which introduces focalism. This trade-off between unconfoundedness and ecological validity is usually absent in traditional experimental design and represents a unique challenge in LLM simulations. We formalize this challenge theoretically, showing it stems from ambiguous prompting strategies, and hence cannot be fully addressed by improving training data or by fine-tuning. Alternative approaches that unblind the experimental design to the LLM show promise. Our findings suggest that effectively leveraging LLMs for experimental simulations requires fundamentally rethinking established experimental design practices rather than simply adapting protocols developed for human subjects.

CYSep 23, 2025
A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

Tianyi Peng, George Gui, Daniel J. Merlau et al.

Digital representations of individuals ("digital twins") promise to transform social science and decision-making. Yet it remains unclear whether such twins truly mirror the people they emulate. We conducted 19 preregistered studies with a representative U.S. panel and their digital twins, each constructed from rich individual-level data, enabling direct comparisons between human and twin behavior across a wide range of domains and stimuli (including never-seen-before ones). Twins reproduced individual responses with 75% accuracy and seemingly low correlation with human answers (approximately 0.2). However, this apparently high accuracy was no higher than that achieved by generic personas based on demographics only. In contrast, correlation improved when twins incorporated detailed personal information, even outperforming traditional machine learning benchmarks that require additional data. Twins exhibited systematic strengths and weaknesses - performing better in social and personality domains, but worse in political ones - and were more accurate for participants with higher education, higher income, and moderate political views and religious attendance. Together, these findings delineate both the promise and the current limits of digital twins: they capture some relative differences among individuals but not yet the unique judgments of specific people. All data and code are publicly available to support the further development and evaluation of digital twin pipelines.

CLDec 13, 2024
Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs

Hortense Fong, George Gui

Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.

CRJun 14, 2018
A Memo on the Proof-of-Stake Mechanism

George Gui, Ali Hortacsu, Jose Tudon

We analyze the economic incentives generated by the proof-of-stake mechanism discussed in the Ethereum Casper upgrade proposal. Compared with proof-of-work, proof-of-stake has a different cost structure for attackers. In Budish (2018), three equations characterize the limits of Bitcoin, which has a proof-of-work mechanism. We investigate their counterparts and evaluate the risk of double-spending attack and sabotage attack. We argue that PoS is safer than PoW agaisnt double-spending attack because of the tractability of attackers, which implies a large "stock" cost for the attacker. Compared to a PoW system whose mining equipments are repurposable, PoS is also safer against a sabotage attack.