The Collapse of Heterogeneity in Silicon Philosophers

arXiv:2604.2357595.4Has Code

AI Analysis

For AI alignment and evaluation, this reveals a critical failure mode of using LLMs as human proxies in domains requiring diverse perspectives.

The paper shows that large language models used as substitutes for human philosophers systematically collapse heterogeneity, over-correlating philosophical judgments and producing artificial consensus across domains, as measured against data from 277 professional philosophers and validated on 1785 respondents.

Silicon samples are increasingly used as a low-cost substitute for human panels and have been shown to reproduce aggregate human opinion with high fidelity. We show that, in the alignment-relevant domain of philosophy, silicon samples systematically collapse heterogeneity. Using data from $N = {277}$ professional philosophers drawn from PhilPeople profiles, we evaluate seven proprietary and open-source large language models on their ability to replicate individual philosophical positions and to preserve cross-question correlation structures across philosophical domains. We find that language models substantially over-correlate philosophical judgments, producing artificial consensus across domains. This collapse is associated in part with specialist effects, whereby models implicitly assume that domain specialists hold highly similar philosophical views. We assess the robustness of these findings by studying the impact of DPO fine-tuning and by validating results against the full PhilPapers 2020 Survey ($N = {1785}$). We conclude by discussing implications for alignment, evaluation, and the use of silicon samples as substitutes for human judgment. The code of this project can be found at https://github.com/stanford-del/silicon-philosophers.

View on arXiv PDF Code

Similar