Olivier Toubia

CY
h-index32
5papers
112citations
Novelty34%
AI Score37

5 Papers

CYFeb 23
Examining and Addressing Barriers to Diversity in LLM-Generated Ideas

Yuting Deng, Melanie Brucks, Olivier Toubia

Ideas generated by independent samples of humans tend to be more diverse than ideas generated from independent LLM samples, raising concerns that widespread reliance on LLMs could homogenize ideation and undermine innovation at a societal level. Drawing on cognitive psychology, we identify (both theoretically and empirically) two mechanisms undermining LLM idea diversity. First, at the individual level, LLMs exhibit fixation just as humans do, where early outputs constrain subsequent ideation. Second, at the collective level, LLMs aggregate knowledge into a unified distribution rather than exhibiting the knowledge partitioning inherent to human populations, where each person occupies a distinct region of the knowledge space. Through four studies, we demonstrate that targeted prompting interventions can address each mechanism independently: Chain-of-Thought (CoT) prompting reduces fixation by encouraging structured reasoning (only in LLMs, not humans), while ordinary personas (versus "creative entrepreneurs" such as Steve Jobs) improve knowledge partitioning by serving as diverse sampling cues, anchoring generation in distinct regions of the semantic space. Combining both approaches produces the highest idea diversity, outperforming humans. These findings offer a theoretically grounded framework for understanding LLM idea diversity and practical strategies for human-AI collaborations that leverage AI's efficiency without compromising the diversity essential to a healthy innovation ecosystem.

AIDec 24, 2023
The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

George Gui, Olivier Toubia

Large Language Models (LLMs) have shown impressive potential to simulate human behavior. We identify a fundamental challenge in using them to simulate experiments: when LLM-simulated subjects are blind to the experimental design (as is standard practice with human subjects), variations in treatment systematically affect unspecified variables that should remain constant, violating the unconfoundedness assumption. Using demand estimation as a context and an actual experiment as a benchmark, we show this can lead to implausible results. While confounding may in principle be addressed by controlling for covariates, this can compromise ecological validity in the context of LLM simulations: controlled covariates become artificially salient in the simulated decision process, which introduces focalism. This trade-off between unconfoundedness and ecological validity is usually absent in traditional experimental design and represents a unique challenge in LLM simulations. We formalize this challenge theoretically, showing it stems from ambiguous prompting strategies, and hence cannot be fully addressed by improving training data or by fine-tuning. Alternative approaches that unblind the experimental design to the LLM show promise. Our findings suggest that effectively leveraging LLMs for experimental simulations requires fundamentally rethinking established experimental design practices rather than simply adapting protocols developed for human subjects.

CYMay 23, 2025
Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions

Olivier Toubia, George Z. Gui, Tianyi Peng et al.

LLM-based digital twin simulation, where large language models are used to emulate individual human behavior, holds great promise for research in AI, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real, individual-level datasets that are both large and publicly available. This lack of high-quality ground truth limits both the development and validation of digital twin methodologies. To address this gap, we introduce a large-scale, public dataset designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of $N = 2,058$ participants (average 2.42 hours per person) in the US across four waves with 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral economics experiments and a pricing survey. The final wave repeats tasks from earlier waves to establish a test-retest accuracy baseline. Initial analyses suggest the data are of high quality and show promise for constructing digital twins that predict human behavior well at the individual and aggregate levels. By making the full dataset publicly available, we aim to establish a valuable testbed for the development and benchmarking of LLM-based persona simulations. Beyond LLM applications, due to its unique breadth and scale the dataset also enables broad social science research, including studies of cross-construct correlations and heterogeneous treatment effects.

CYSep 23, 2025
A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

Tianyi Peng, George Gui, Daniel J. Merlau et al.

Digital representations of individuals ("digital twins") promise to transform social science and decision-making. Yet it remains unclear whether such twins truly mirror the people they emulate. We conducted 19 preregistered studies with a representative U.S. panel and their digital twins, each constructed from rich individual-level data, enabling direct comparisons between human and twin behavior across a wide range of domains and stimuli (including never-seen-before ones). Twins reproduced individual responses with 75% accuracy and seemingly low correlation with human answers (approximately 0.2). However, this apparently high accuracy was no higher than that achieved by generic personas based on demographics only. In contrast, correlation improved when twins incorporated detailed personal information, even outperforming traditional machine learning benchmarks that require additional data. Twins exhibited systematic strengths and weaknesses - performing better in social and personality domains, but worse in political ones - and were more accurate for participants with higher education, higher income, and moderate political views and religious attendance. Together, these findings delineate both the promise and the current limits of digital twins: they capture some relative differences among individuals but not yet the unique judgments of specific people. All data and code are publicly available to support the further development and evaluation of digital twin pipelines.

HCJul 6, 2021
Understanding Consumer Preferences for Explanations Generated by XAI Algorithms

Yanou Ramon, Tom Vermeire, Olivier Toubia et al.

Explaining firm decisions made by algorithms in customer-facing applications is increasingly required by regulators and expected by customers. While the emerging field of Explainable Artificial Intelligence (XAI) has mainly focused on developing algorithms that generate such explanations, there has not yet been sufficient consideration of customers' preferences for various types and formats of explanations. We discuss theoretically and study empirically people's preferences for explanations of algorithmic decisions. We focus on three main attributes that describe automatically-generated explanations from existing XAI algorithms (format, complexity, and specificity), and capture differences across contexts (online targeted advertising vs. loan applications) as well as heterogeneity in users' cognitive styles. Despite their popularity among academics, we find that counterfactual explanations are not popular among users, unless they follow a negative outcome (e.g., loan application was denied). We also find that users are willing to tolerate some complexity in explanations. Finally, our results suggest that preferences for specific (vs. more abstract) explanations are related to the level at which the decision is construed by the user, and to the deliberateness of the user's cognitive style.