Gordon Burtch

h-index33

3papers

6,313citations

3 Papers

11.6CYJul 12

Gordon Burtch

Large language models (LLMs) often produce homogeneous outputs, raising concerns that AI coding assistants may lead to convergence in the software artifacts that developers create. Whether this occurs in practice is unclear because developers interactively prompt, evaluate, modify, and reject model outputs, and because outputs vary with prompt and repository context. I examine code homogenization using Kaggle contest submissions from 2019 to mid-2026. I first document widespread convergence toward the random seed value 42, consistent with LLMs reinforcing a longstanding convention in programming culture. I then study homogenization more broadly, at two levels of aggregation and abstraction. At the submission level, I measure the average pairwise similarity of submissions within contests. At the contest level, I measure the conceptual span of submitted code, motivating distinct measures for each: TF-IDF representations, which capture surface syntax, and Voyage 3 code embeddings, which capture code intent and semantics. The results demonstrate substantial syntactic homogenization at both the individual and collective levels: individual submissions have become more alike in literal syntax and code structure, while the latent dimensionality of syntactic variation has narrowed. In contrast, I find little evidence of semantic homogenization, individually and collectively. Average semantic distance remains essentially flat, and the contest-level latent dimensional span of semantic approaches remains stable, with evidence suggesting it has even expanded modestly. These findings suggest that AI coding assistants are certainly standardizing implementation details, yet they have not yet produced evidence of homogenization in the approaches and problem-solving strategies coders employ.

16.1GNOct 25, 2024

Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

Yuan Gao, Dokyun Lee, Gordon Burtch et al.

Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. Causes of failure are diverse and unpredictable, relating to input language, roles, and safeguarding. These results advise caution when using LLMs to study human behavior or as surrogates or simulations.

1.2EMDec 19, 2020

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Mochen Yang, Edward McFowland, Gordon Burtch et al.

Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy involves the application of predictive modeling techniques to 'mine' variables of interest from available data, followed by the inclusion of those variables into an econometric framework, with the objective of estimating causal effects. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest. We propose employing random forest not just for prediction, but also for generating instrumental variables to address the measurement error embedded in the prediction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make 'different' mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees serve as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases and its superior performance over three alternative methods for bias correction.