AIFeb 24
PreScience: A Benchmark for Forecasting Scientific ContributionsAnirudh Ajith, Amanpreet Singh, Jay DeYoung et al.
Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.
HCApr 3, 2025
LLM Social Simulations Are a Promising Research MethodJacy Reese Anthis, Ryan Liu, Sean M. Richardson et al.
Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted this method. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a review of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions, including context-rich prompting and fine-tuning with social science datasets. We believe that LLM social simulations can already be used for pilot and exploratory studies, and more widespread use may soon be possible with rapidly advancing LLM capabilities. Researchers should prioritize developing conceptual models and iterative evaluations to make the best use of new AI systems.
54.0CLApr 29
Semantic Structure of Feature Space in Large Language ModelsAustin C. Kozlowski, Andrei Boutyline
We show that the geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. We construct feature vectors corresponding to 360 words and project them on 32 semantic axes (e.g. beautiful-ugly, soft-hard), and find that these projections correlate highly with human ratings of those words on the respective semantic scales. Second, we find that the cosine similarities between the semantic axes themselves are highly predictive of the correlations between these scales in the survey. Third, we show that substantial variance across the 32 semantic axes lies on a low-dimensional subspace, reproducing patterns typical of human semantic associations. Finally, we demonstrate that steering a word on one semantic axis causes spillover effects on the model's rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes. These findings suggest that features should be understood not only in isolation but through their geometric relations and the meaningful subspaces they form.
CYMay 23, 2024
In Silico Sociology: Forecasting COVID-19 Polarization with Large Language ModelsAustin C. Kozlowski, Hyunku Kwon, James A. Evans
By training deep neural networks on massive archives of digitized text, large language models (LLMs) learn the complex linguistic patterns that constitute historic and contemporary discourses. We argue that LLMs can serve as a valuable tool for sociological inquiry by enabling accurate simulation of respondents from specific social and cultural contexts. Applying LLMs in this capacity, we reconstruct the public opinion landscape of 2019 to examine the extent to which the future polarization over COVID-19 was prefigured in existing political discourse. Using an LLM trained on texts published through 2019, we simulate the responses of American liberals and conservatives to a battery of pandemic-related questions. We find that the simulated respondents reproduce observed partisan differences in COVID-19 attitudes in 84% of cases, significantly greater than chance. Prompting the simulated respondents to justify their responses, we find that much of the observed partisan gap corresponds to differing appeals to freedom, safety, and institutional trust. Our findings suggest that the politicization of COVID-19 was largely consistent with the prior ideological landscape, and this unprecedented event served to advance history along its track rather than change the rails.
CLAug 4, 2025
Semantic Structure in Large Language Model EmbeddingsAustin C. Kozlowski, Callin Dai, Andrei Boutyline
Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.
CLMar 25, 2018
The Geometry of Culture: Analyzing Meaning through Word EmbeddingsAustin C. Kozlowski, Matt Taddy, James A. Evans
We demonstrate the utility of a new methodological tool, neural-network word embedding models, for large-scale text analysis, revealing how these models produce richer insights into cultural associations and categories than possible with prior methods. Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture. We show that dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning, and the projection of words onto these dimensions reflects widely shared cultural connotations when compared to surveyed responses and labeled historical data. We pilot a method for testing the stability of these associations, then demonstrate applications of word embeddings for macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century and a comparative analysis of historic distinctions between markers of gender and class in the U.S. and Britain. We argue that the success of these high-dimensional models motivates a move towards "high-dimensional theorizing" of meanings, identities and cultural processes.