AI EM MLNov 30, 2025

Foundation Priors

arXiv:2512.01107v17.8h-index: 2

Originality Incremental advance

AI Analysis

This provides a principled framework for researchers and decision-makers to leverage generative AI outputs while addressing subjectivity and bias, though it is incremental in refining existing statistical methods.

The paper tackles the problem of using synthetic outputs from foundation models as data in empirical research by introducing the concept of a foundation prior, which treats these outputs as draws from a prior predictive distribution rather than real observations, and shows how to incorporate them into statistical workflows to avoid conflating synthetic data with real facts.

Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these ''synthetic'' outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which shows that model-generated outputs are not as real observations, but draws from the foundation prior induced prior predictive distribution. As such synthetic data reflects both the model's learned patterns and the user's subjective priors, expectations, and biases. We model the subjectivity of the generative process by making explicit the dependence of synthetic outputs on the user's anticipated data distribution, the prompt-engineering process, and the trust placed in the foundation model. We derive the foundation prior as an exponential-tilted, generalized Bayesian update of the user's primitive prior, where a trust parameter governs the weight assigned to synthetic data. We then show how synthetic data and the associated foundation prior can be incorporated into standard statistical and econometric workflows, and discuss their use in applications such as refining complex models, informing latent constructs, guiding experimental design, and augmenting random-coefficient and partially linear specifications. By treating generative outputs as structured, explicitly subjective priors rather than as empirical observations, the framework offers a principled way to harness foundation models in empirical work while avoiding the conflation of synthetic ''facts'' with real data.

View on arXiv PDF

Similar