James Evans

AI
h-index78
44papers
877citations
Novelty46%
AI Score56

44 Papers

HCMay 29
AI Behavioral Science

Matthew O. Jackson, Qiaozhu Me, Stephanie W. Wang et al.

We outline a foundation for a new field of ``AI Behavioral Science,'' covering three perspectives. First, as AI becomes ubiquitous and is increasingly proprietary and opaque, it becomes vital to develop techniques for assessing AI behavior. We outline how tools developed to assess people's behaviors by social scientists can be used to assess and infer AI's behaviors biases, tendencies, and heuristics. Second, we also discuss how AI can change the ways in which we learn about human behavior. Beyond its computational power, AI offers new techniques for simulating, inferring, and predicting human behaviors that we outline and discuss. Third, as humans and AI are interacting in increasingly complex and intertwined systems, we need to understand the implications for the resulting economic and political outcomes. We outline issues that are increasingly pressing concerning the future of human-AI interactions and potential changes and disruptions that can ensue.

AIJul 9, 2023
The Future of Fundamental Science Led by Generative Closed-Loop Artificial Intelligence

Hector Zenil, Jesper Tegnér, Felipe S. Abrahão et al. · cambridge

Recent advances in machine learning and AI, including Generative AI and LLMs, are disrupting technological innovation, product development, and society as a whole. AI's contribution to technology can come from multiple approaches that require access to large training data sets and clear performance evaluation criteria, ranging from pattern recognition and classification to generative models. Yet, AI has contributed less to fundamental science in part because large data sets of high-quality data for scientific practice and model discovery are more difficult to access. Generative AI, in general, and Large Language Models in particular, may represent an opportunity to augment and accelerate the scientific discovery of fundamental deep science with quantitative models. Here we explore and investigate aspects of an AI-driven, automated, closed-loop approach to scientific discovery, including self-driven hypothesis generation and open-ended autonomous exploration of the hypothesis space. Integrating AI-driven automation into the practice of science would mitigate current problems, including the replication of findings, systematic production of data, and ultimately democratisation of the scientific process. Realising these possibilities requires a vision for augmented AI coupled with a diversity of AI approaches able to deal with fundamental aspects of causality analysis and model discovery while enabling unbiased search across the space of putative explanations. These advances hold the promise to unleash AI's potential for searching and discovering the fundamental structure of our world beyond what human scientists have been able to achieve. Such a vision would push the boundaries of new fundamental science rather than automatize current workflows and instead open doors for technological innovation to tackle some of the greatest challenges facing humanity today.

AIJun 2, 2023
Accelerating science with human-aware artificial intelligence

Jamshid Sourati, James Evans

Artificial intelligence (AI) models trained on published scientific findings have been used to invent valuable materials and targeted therapies, but they typically ignore the human scientists who continually alter the landscape of discovery. Here we show that incorporating the distribution of human expertise by training unsupervised models on simulated inferences cognitively accessible to experts dramatically improves (up to 400%) AI prediction of future discoveries beyond those focused on research content alone, especially when relevant literature is sparse. These models succeed by predicting human predictions and the scientists who will make them. By tuning human-aware AI to avoid the crowd, we can generate scientifically promising "alien" hypotheses unlikely to be imagined or pursued without intervention until the distant future, which hold promise to punctuate scientific advance beyond questions currently pursued. Accelerating human discovery or probing its blind spots, human-aware AI enables us to move toward and beyond the contemporary scientific frontier.

AIJul 2, 2022
Complementary artificial intelligence designed to augment human discovery

Jamshid Sourati, James Evans

Neither artificial intelligence designed to play Turing's imitation game, nor augmented intelligence built to maximize the human manipulation of information are tuned to accelerate innovation and improve humanity's collective advance against its greatest challenges. We reconceptualize and pilot beneficial AI to radically augment human understanding by complementing rather than competing with human cognitive capacity. Our approach to complementary intelligence builds on insights underlying the wisdom of crowds, which hinges on the independence and diversity of crowd members' information and approach. By programmatically incorporating information on the evolving distribution of scientific expertise from research papers, our approach follows the distribution of content in the literature while avoiding the scientific crowd and the hypotheses cognitively available to it. We use this approach to generate valuable predictions for what materials possess valuable energy-related properties (e.g., thermoelectricity), and what compounds possess valuable medical properties (e.g., asthma) that complement the human scientific crowd. We demonstrate that our complementary predictions, if identified by human scientists and inventors at all, are only discovered years further into the future. When we evaluate the promise of our predictions with first-principles equations, we demonstrate that increased complementarity of our predictions does not decrease and in some cases increases the probability that the predictions possess the targeted properties. In summary, by tuning AI to avoid the crowd, we can generate hypotheses unlikely to be imagined or pursued until the distant future and promise to punctuate scientific advance. By identifying and correcting for collective human bias, these models also suggest opportunities to improve human prediction by reformulating science education for discovery.

CLMar 30
Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Junsol Kim, Winnie Street, Roberta Rocca et al.

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

AIJan 15
Generative AI collective behavior needs an interactionist paradigm

Laura Ferrarotti, Gian Maria Campedelli, Roberto Dessì et al.

In this article, we argue that understanding the collective behavior of agents based on large language models (LLMs) is an essential area of inquiry, with important implications in terms of risks and benefits, impacting us as a society at many levels. We claim that the distinctive nature of LLMs--namely, their initialization with extensive pre-trained knowledge and implicit social priors, together with their capability of adaptation through in-context learning--motivates the need for an interactionist paradigm consisting of alternative theoretical foundations, methodologies, and analytical tools, in order to systematically examine how prior knowledge and embedded values interact with social context to shape emergent phenomena in multi-agent generative AI systems. We propose and discuss four directions that we consider crucial for the development and deployment of LLM-based collectives, focusing on theory, methods, and trans-disciplinary dialogue.

AIFeb 24
PreScience: A Benchmark for Forecasting Scientific Contributions

Anirudh Ajith, Amanpreet Singh, Jay DeYoung et al.

Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.

CYJan 9
Classroom AI: Large Language Models as Grade-Specific Teachers

Jio Oh, Steven Euijong Whang, James Evans et al.

Large Language Models (LLMs) offer a promising solution to complement traditional teaching and address global teacher shortages that affect hundreds of millions of children, but they fail to provide grade-appropriate responses for students at different educational levels. We introduce a framework for finetuning LLMs to generate age-appropriate educational content across six grade levels, from lower elementary to adult education. Our framework successfully adapts explanations to match students' comprehension capacities without sacrificing factual correctness. This approach integrates seven established readability metrics through a clustering method and builds a comprehensive dataset for grade-specific content generation. Evaluations across multiple datasets with 208 human participants demonstrate substantial improvements in grade-level alignment, achieving a 35.64 percentage point increase compared to prompt-based methods while maintaining response accuracy. AI-assisted learning tailored to different grade levels has the potential to advance educational engagement and equity.

CLJan 15
Reasoning Models Generate Societies of Thought

Junsol Kim, Shiyang Lai, Nino Scherrer et al.

Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions -- a society of thought -- which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.

MAMay 16
Multi-LLM Systems Exhibit Robust Semantic Collapse

Weiyi Kong, Shiyang Lai, Jinghua Piao et al.

Whether machines can originate novel content has been debated for nearly two centuries, from Lovelace's assertion that no engine can "originate anything" to Turing's question of whether a machine can amplify ideas brought in from outside. Multi-large language model (LLM) systems, increasingly deployed for autonomous generation, reopen this question empirically. Here we show that such systems, operating in closed loops, exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. Across model families, extended simulations of 200 to 1,000 rounds, the pattern remains consistent. Twelve intervention strategies, spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning, fail to restore semantic diversity. Mechanistic analyses suggest that semantic collapse is not explained by alignment or conformity biases, but is consistent with intrinsic properties of autoregressive generation. Our results point to fundamental constraints in the ability of multi-LLM systems to sustain open-ended knowledge production in closed-loop settings.

AIMar 18
Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence

Bofan Gong, Shiyang Lai, James Evans et al.

Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four foci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.

AIMar 21
Agentic AI and the next intelligence explosion

James Evans, Benjamin Bratton, Blaise Agüera y Arcas

The "AI singularity" is often miscast as a monolithic, godlike mind. Evolution suggests a different path: intelligence is fundamentally plural, social, and relational. Recent advances in agentic AI reveal that frontier reasoning models, such as DeepSeek-R1, do not improve simply by "thinking longer". Instead, they simulate internal "societies of thought," spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. Moreover, we are entering an era of human-AI centaurs: hybrid actors where collective agency transcends individual control. Scaling this intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols, modeled on organizations and markets, we can build a social infrastructure of checks and balances. The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island.

CYMar 24
Sibling Rivalry in the Ivory Tower: Mass Science, Expanding Scholarly Families, and the Reshaping of Academic Stratification

Likun Cao, Jie Hua, James Evans

This paper investigates mechanisms underlying scientific stratification in the transition from elite to mass science. Existing scholarship has examined stratification through the Matthew effect framework, but this approach is increasingly limited as mass, team-based research becomes dominant. While scientists now share institutions and lineages, substantial career outcome differences remain unexplained. We propose integrating demographic concepts into science studies. Drawing parallels between biological families and scholarly lineages as fundamental reproductive units, we adapt the birth order concept to examine how doctoral student sequence within a lineage shapes career trajectories. Using data on over one million U.S. doctoral graduates, we find that later students of the same advisor systematically underperform earlier ones across multiple achievement dimensions, both short and long term. Examining underlying mechanisms reveals that although advisors invest comparable resources in all students, later students receive less cognitive stimulation from mature scholars than peers and specialize in narrower niches under peer differentiation pressure. Both of these factors constrain intellectual development and subsequent success. By introducing a demographic framework, this paper offers new perspectives on scientific stratification and demonstrates how demographic concepts can fruitfully analyze broader social and epistemic systems.

AIFeb 7, 2024
Can Large Language Model Agents Simulate Human Trust Behavior?

Chengxing Xie, Canyu Chen, Feiran Jia et al.

Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in social science and role-playing applications. However, one fundamental question remains: can LLM agents really simulate human behavior? In this paper, we focus on one critical and elemental behavior in human interactions, trust, and investigate whether LLM agents can simulate human trust behavior. We first find that LLM agents generally exhibit trust behavior, referred to as agent trust, under the framework of Trust Games, which are widely recognized in behavioral economics. Then, we discover that GPT-4 agents manifest high behavioral alignment with humans in terms of trust behavior, indicating the feasibility of simulating human trust behavior with LLM agents. In addition, we probe the biases of agent trust and differences in agent trust towards other LLM agents and humans. We also explore the intrinsic properties of agent trust under conditions including external manipulations and advanced reasoning strategies. Our study provides new insights into the behaviors of LLM agents and the fundamental analogy between LLMs and humans beyond value alignment. We further illustrate broader implications of our discoveries for applications where trust is paramount.

HCApr 3, 2025
LLM Social Simulations Are a Promising Research Method

Jacy Reese Anthis, Ryan Liu, Sean M. Richardson et al.

Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted this method. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a review of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions, including context-rich prompting and fine-tuning with social science datasets. We believe that LLM social simulations can already be used for pilot and exploratory studies, and more widespread use may soon be possible with rapidly advancing LLM capabilities. Researchers should prioritize developing conceptual models and iterative evaluations to make the best use of new AI systems.

CLOct 31, 2024
Hidden Persuaders: LLMs' Political Leaning and Their Influence on Voters

Yujin Potter, Shiyang Lai, Junsol Kim et al.

How could LLMs influence our democracy? We investigate LLMs' political leanings and the potential influence of LLMs on voters by conducting multiple experiments in a U.S. presidential election context. Through a voting simulation, we first demonstrate 18 open- and closed-weight LLMs' political preference for a Democratic nominee over a Republican nominee. We show how this leaning towards the Democratic nominee becomes more pronounced in instruction-tuned models compared to their base versions by analyzing their responses to candidate-policy related questions. We further explore the potential impact of LLMs on voter choice by conducting an experiment with 935 U.S. registered voters. During the experiments, participants interacted with LLMs (Claude-3, Llama-3, and GPT-4) over five exchanges. The experiment results show a shift in voter choices towards the Democratic nominee following LLM interaction, widening the voting margin from 0.7% to 4.6%, even though LLMs were not asked to persuade users to support the Democratic nominee during the discourse. This effect is larger than many previous studies on the persuasiveness of political campaigns, which have shown minimal effects in presidential elections. Many users also expressed a desire for further political interaction with LLMs. Which aspects of LLM interactions drove these shifts in voter choice requires further study. Lastly, we explore how a safety method can make LLMs more politically neutral, while raising the question of whether such neutrality is truly the path forward.

SIMay 3
Computational foundations of the human world

Marcus J. Hamilton, Abhishek Yadav, Harrison Hartle et al.

Human societies continuously transform scattered information into collective judgments and coordinated action, whether through markets discovering prices, governments allocating resources, communities enforcing norms, or science converging on reliable claims. Importantly, the computational difficulty of collective decision-making, particularly the time and communication required to reach solutions, imposes fundamental constraints on social organization. While theoretical computer science offers formal tools for analyzing such problems, for instance, by analyzing resource requirements, including time and memory, surprisingly, there is no domain of social science that focuses on the nature of computation in the human world. This perspective argues that we now have the opportunity to deploy these computational frameworks to study human social organization, opening research directions at the intersection of computer science and social science. We highlight core social phenomena that can be framed as computational, including (i) distributed consensus and coordinated action, (ii) societal restructuring with scale, (iii) hierarchical and modular structure, and (iv) externalized memory systems. We identify several concepts from theoretical computer science that may provide insight into these phenomena, especially emphasizing more recently developed approaches beyond the paradigm of Turing~Machines and worst-case computational complexity.

SOC-PHDec 30, 2025
Deep versus Broad Technology Search and the Timing of Innovation Impact

Likun Cao, James Evans

This study offers a new perspective on the depth-versus-breadth debate in innovation strategy, by modeling inventive search within dynamic collective knowledge systems, and underscoring the importance of timing for technological impact. Using frontier machine learning to project patent citation networks in hyperbolic space, we analyze 4.9 million U.S. patents to examine how search strategies give rise to distinct temporal patterns in impact accumulation. We find that inventions based on deep search, which relies on a specialized understanding of complex recombination structures, drive higher short-term impact through early adoption within specialized communities, but face diminishing returns as innovations become "locked-in" with limited diffusion potential. Conversely, when inventions are grounded in broad search that spans disparate domains, they encounter initial resistance but achieve wider diffusion and greater long-term impact by reaching cognitively diverse audiences. Individual inventions require both depth and breadth for stable impact. Organizations can strategically balance approaches across multiple inventions: using depth to build reliable technological infrastructure while pursuing breadth to expand applications. We advance innovation theory by demonstrating how deep and broad search strategies distinctly shape the timing and trajectory of technological impact, and how individual inventors and organizations can leverage these mechanisms to balance exploitation and exploration.

SIApr 14
Generalization and the Rise of System-level Creativity in Science

Hongbo Fang, James Evans

Scientific progress has long been understood as recombinant, with breakthroughs arising when existing ideas are joined in new ways. Empirical work in this tradition has focused on the inputs to discovery, asking whether a paper draws together atypical or distant prior knowledge. Far less is known about how knowledge is supplied for downstream recombination, or how individual contributions are forged to play distinct and distant roles in the broader system of science. Using citation networks from tens of millions of publications in OpenAlex and the Web of Science, here we show that scientific contributions stably decompose into three functional types, foundations, extensions, and generalizations, distinguishable by the local structure of their forward citations. This decomposition of the 'functional role' of scientific work presents an unseen pattern of scientific production: foundational and extensional work, which respectively build and elaborate within disciplines, dominated the post-war decades but has declined steadily since the early 1990s, while generalizations, meaning compressed and modular contributions reused in distant fields, have risen sharply. Stacked difference-in-differences analyses that exploit venues' transitions to online access and authors' adoption of large language models provide causal evidence that digital knowledge infrastructure is driving this shift. The locus of innovation has thus migrated from within what Simon might characterize as nearly decomposable disciplinary modules to the interfaces between them, recasting the much-discussed decline of disruption as a structural reorganization of science rather than a slowdown, and revealing a growing misalignment between how science now advances and how it is recognized and rewarded.

CLMar 3, 2025
Linear Representations of Political Perspective Emerge in Large Language Models

Junsol Kim, James Evans, Aaron Schein

Large language models (LLMs) have demonstrated the ability to generate text that realistically reflects a range of different subjective human perspectives. This paper studies how LLMs are seemingly able to reflect more liberal versus more conservative viewpoints among other political perspectives in American politics. We show that LLMs possess linear representations of political perspectives within activation space, wherein more similar perspectives are represented closer together. To do so, we probe the attention heads across the layers of three open transformer-based LLMs (Llama-2-7b-chat, Mistral-7b-instruct, Vicuna-7b). We first prompt models to generate text from the perspectives of different U.S. lawmakers. We then identify sets of attention heads whose activations linearly predict those lawmakers' DW-NOMINATE scores, a widely-used and validated measure of political ideology. We find that highly predictive heads are primarily located in the middle layers, often speculated to encode high-level concepts and tasks. Using probes only trained to predict lawmakers' ideology, we then show that the same probes can predict measures of news outlets' slant from the activations of models prompted to simulate text from those news outlets. These linear probes allow us to visualize, interpret, and monitor ideological stances implicitly adopted by an LLM as it generates open-ended responses. Finally, we demonstrate that by applying linear interventions to these attention heads, we can steer the model outputs toward a more liberal or conservative stance. Overall, our research suggests that LLMs possess a high-level linear representation of American political ideology and that by leveraging recent advances in mechanistic interpretability, we can identify, monitor, and steer the subjective perspective underlying generated text.

CYSep 12, 2025
We Need a New Ethics for a World of AI Agents

Iason Gabriel, Geoff Keeling, Arianna Manzini et al.

The deployment of capable AI agents raises fresh questions about safety, human-machine relationships and social coordination. We argue for greater engagement by scientists, scholars, engineers and policymakers with the implications of a world increasingly populated by AI agents. We explore key challenges that must be addressed to ensure that interactions between humans and agents, and among agents themselves, remain broadly beneficial.

AIOct 25, 2024
Designing AI-Agents with Personalities: A Psychometric Approach

Muhua Huang, Xijuan Zhang, Christopher Soto et al.

We introduce a methodology for assigning quantifiable and psychometrically validated personalities to AI-Agents using the Big Five framework. Across three studies, we evaluate its feasibility and limitations. In Study 1, we show that large language models (LLMs) capture semantic similarities among Big Five measures, providing a basis for personality assignment. In Study 2, we create AI-Agents using prompts designed based on the Big Five Inventory-2 (BFI-2) in different format, and find that AI-Agents powered by new models align more closely with human responses on the Mini-Markers test, although the finer pattern of results (e.g., factor loading patterns) were sometimes inconsistent. In Study 3, we validate our AI-Agents on risk-taking and moral dilemma vignettes, finding that models prompted with the BFI-2-Expanded format most closely reproduce human personality-decision associations, while safety-aligned models generally inflate 'moral' ratings. Overall, our results show that AI-Agents align with humans in correlations between input Big Five traits and output responses and may serve as useful tools for preliminary research. Nevertheless, discrepancies in finer response patterns indicate that AI-Agents cannot (yet) fully substitute for human participants in precision or high-stakes projects.

CLFeb 19, 2024
Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation

Shiyang Lai, Yujin Potter, Junsol Kim et al.

Large language model behavior is shaped by the language of those with whom they interact. This capacity and their increasing prevalence online portend that they will intentionally or unintentionally "program" one another and form emergent AI subjectivities, relationships, and collectives. Here, we call upon the research community to investigate these "societies" of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. We use a small "community" of models and their evolving outputs to illustrate how such emergent, decentralized AI collectives can spontaneously expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. Finally, we discuss opportunities for AI cross-moderation and address ethical issues and design challenges associated with creating and maintaining free-formed AI collectives.

AIMay 22, 2025
Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery

Yanbo Zhang, Sumeer A. Khan, Adnan Mahmud et al.

With recent Nobel Prizes recognising AI contributions to science, Large Language Models (LLMs) are transforming scientific research by enhancing productivity and reshaping the scientific method. LLMs are now involved in experimental design, data analysis, and workflows, particularly in chemistry and biology. However, challenges such as hallucinations and reliability persist. In this contribution, we review how Large Language Models (LLMs) are redefining the scientific method and explore their potential applications across different stages of the scientific cycle, from hypothesis testing to discovery. We conclude that, for LLMs to serve as relevant and effective creative engines and productivity enhancers, their deep integration into all steps of the scientific process should be pursued in collaboration and alignment with human scientific goals, with clear evaluation metrics. The transition to AI-driven science raises ethical questions about creativity, oversight, and responsibility. With careful guidance, LLMs could evolve into creative engines, driving transformative breakthroughs across scientific disciplines responsibly and effectively. However, the scientific community must also decide how much it leaves to LLMs to drive science, even when associations with 'reasoning', mostly currently undeserved, are made in exchange for the potential to explore hypothesis and solution regions that might otherwise remain unexplored by human exploration alone.

CLApr 14, 2025
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models

Veniamin Veselovsky, Berke Argin, Benedikt Stroebl et al.

Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English-centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit-implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non-English languages we explore, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.

HCAug 12, 2025
Biased AI improves human decision-making but reduces trust

Shiyang Lai, Junsol Kim, Nadav Kunievsky et al.

Current AI systems minimize risk by enforcing ideological neutrality, yet this may introduce automation bias by suppressing cognitive engagement in human decision-making. We conducted randomized trials with 2,500 participants to test whether culturally biased AI enhances human decision-making. Participants interacted with politically diverse GPT-4o variants on information evaluation tasks. Partisan AI assistants enhanced human performance, increased engagement, and reduced evaluative bias compared to non-biased counterparts, with amplified benefits when participants encountered opposing views. These gains carried a trust penalty: participants underappreciated biased AI and overcredited neutral systems. Exposing participants to two AIs whose biases flanked human perspectives closed the perception-performance gap. These findings complicate conventional wisdom about AI neutrality, suggesting that strategic integration of diverse cultural biases may foster improved and resilient human decision-making.

MAOct 25, 2025
Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration

Zheng Wei, Mingchen Li, Zeqian Zhang et al.

Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi-agent collaboration. First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation that leverages a film-production-inspired architecture to enable modular specialization and scalable inter-agent collaboration. Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions among agents lacking sufficient context, reducing individual memory requirements while ensuring adequate contextual information. Third, we transition from directed acyclic graphs (DAGs) to directed cyclic graphs with limited retries, allowing agents to reflect and refine outputs iteratively, thereby improving earlier stages through feedback from subsequent nodes. These contributions lay the groundwork for developing more robust multi-agent systems in creative tasks.

AISep 18, 2025
Rationality Check! Benchmarking the Rationality of Large Language Models

Zhilun Zhou, Jing Yi Wang, Nicholas Sukiennik et al.

Large language models (LLMs), a recent advance in deep learning and machine intelligence, have manifested astonishing capacities, now considered among the most promising for artificial general intelligence. With human-like capabilities, LLMs have been used to simulate humans and serve as AI assistants across many applications. As a result, great concern has arisen about whether and under what circumstances LLMs think and behave like real human agents. Rationality is among the most important concepts in assessing human behavior, both in thinking (i.e., theoretical rationality) and in taking action (i.e., practical rationality). In this work, we propose the first benchmark for evaluating the omnibus rationality of LLMs, covering a wide range of domains and LLMs. The benchmark includes an easy-to-use toolkit, extensive experimental results, and analysis that illuminates where LLMs converge and diverge from idealized human rationality. We believe the benchmark can serve as a foundational tool for both developers and users of LLMs.

CYJun 30, 2025
Epitome: Pioneering an Experimental Platform for AI-Social Science Integration

Jingjing Qu, Kejia Hu, Jun Zhu et al.

The integration of Large Language Models (LLMs) into social science experiments represents a transformative approach to understanding human-AI interactions and their societal impacts. We introduce Epitome, the world's first open experimental platform dedicated to the deep integration of artificial intelligence and social science. Rooted in theoretical foundations from management, communication studies, sociology, psychology, and ethics, Epitome focuses on the interactive impacts of AI on individuals, organizations, and society during its real-world deployment. It constructs a theoretical support system through cross-disciplinary experiments. The platform offers a one-stop comprehensive experimental solution spanning "foundation models-complex application development-user feedback" through seven core modules, while embedding the classical "control-comparison-comparative causal logic" of social science experiments into multilevel human-computer interaction environments, including dialogues, group chats, and multi-agent virtual scenarios. With its canvas-style, user-friendly interface, Epitome enables researchers to easily design and run complex experimental scenarios, facilitating systematic investigations into the social impacts of AI and exploration of integrated solutions.To demonstrate its capabilities, we replicated three seminal social science experiments involving LLMs, showcasing Epitome's potential to streamline complex experimental designs and produce robust results, suitable for publishing in the top selective journals. Our findings highlight the platform's utility in enhancing the efficiency and quality of human-AI interactions, providing valuable insights into the societal implications of AI technologies. Epitome thus offers a powerful tool for advancing interdisciplinary research at the intersection of AI and social science, with potential applications in policy-making, ...

CLJun 5, 2025
Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Likun Cao, Rui Pan, James Evans

Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior experience. We theorize and then quantify subjective perspectives and their interaction based on innovator positions within the geometric space of concepts inscribed by dynamic machine-learned language representations. Using data on millions of scientists, inventors, screenplay writers, entrepreneurs, and Wikipedia contributors across their respective creative domains, here we show that measured subjective perspectives predict which ideas individuals and groups will creatively attend to and successfully combine in the future. Across all cases and time periods we examine, when perspective diversity is decomposed as the difference between collaborators' perspectives on their creation, and background diversity as the difference between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite. We analyze a natural experiment and simulate creative collaborations between AI agents designed with various perspective and background diversity, which support our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experiences obtained through trajectories of prior work. These perspectives converge and provoke one another to innovate. We examine the significance of these findings for team formation and research policy.

MAMay 28, 2025
Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Jiaxi Yang, Mengqi Zhang, Yiqiao Jin et al. · amazon-science

Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. However, the topology of these systems--how agents in MASs should be configured, connected, and coordinated--remains largely unexplored. In this position paper, we call for a paradigm shift toward \emph{topology-aware MASs} that explicitly model and dynamically optimize the structure of inter-agent interactions. We identify three fundamental components--agents, communication links, and overall topology--that collectively determine the system's adaptability, efficiency, robustness, and fairness. To operationalize this vision, we introduce a systematic three-stage framework: 1) agent selection, 2) structure profiling, and 3) topology synthesis. This framework not only provides a principled foundation for designing MASs but also opens new research frontiers across language modeling, reinforcement learning, graph learning, and generative modeling to ultimately unleash their full potential in complex real-world applications. We conclude by outlining key challenges and opportunities in MASs evaluation. We hope our framework and perspectives offer critical new insights in the era of agentic AI.

GNMay 13, 2025
Big Data and the Computational Social Science of Entrepreneurship and Innovation

Ningzi Li, Shiyang Lai, James Evans

As large-scale social data explode and machine-learning methods evolve, scholars of entrepreneurship and innovation face new research opportunities but also unique challenges. This chapter discusses the difficulties of leveraging large-scale data to identify technological and commercial novelty, document new venture origins, and forecast competition between new technologies and commercial forms. It suggests how scholars can take advantage of new text, network, image, audio, and video data in two distinct ways that advance innovation and entrepreneurship research. First, machine-learning models, combined with large-scale data, enable the construction of precision measurements that function as system-level observatories of innovation and entrepreneurship across human societies. Second, new artificial intelligence models fueled by big data generate 'digital doubles' of technology and business, forming laboratories for virtual experimentation about innovation and entrepreneurship processes and policies. The chapter argues for the advancement of theory development and testing in entrepreneurship and innovation by coupling big data with big models.

AIMar 8, 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity

HyunJin Kim, Xiaoyuan Yi, Jing Yao et al.

The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.

NEJun 27, 2024
Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction

Blaise Agüera y Arcas, Jyrki Alakuijala, James Evans et al.

The fields of Origin of Life and Artificial Life both question what life is and how it emerges from a distinct set of "pre-life" dynamics. One common feature of most substrates where life emerges is a marked shift in dynamics when self-replication appears. While there are some hypotheses regarding how self-replicators arose in nature, we know very little about the general dynamics, computational principles, and necessary conditions for self-replicators to emerge. This is especially true on "computational substrates" where interactions involve logical, mathematical, or programming rules. In this paper we take a step towards understanding how self-replicators arise by studying several computational substrates based on various simple programming languages and machine instruction sets. We show that when random, non self-replicating programs are placed in an environment lacking any explicit fitness landscape, self-replicators tend to arise. We demonstrate how this occurs due to random interactions and self-modification, and can happen with and without background random mutations. We also show how increasingly complex dynamics continue to emerge following the rise of self-replicators. Finally, we show a counterexample of a minimalistic programming language where self-replicators are possible, but so far have not been observed to arise.

APDec 11, 2021
The Past as a Stochastic Process

David H. Wolpert, Michael H. Price, Stefani A. Crabtree et al.

Historical processes manifest remarkable diversity. Nevertheless, scholars have long attempted to identify patterns and categorize historical actors and influences with some success. A stochastic process framework provides a structured approach for the analysis of large historical datasets that allows for detection of sometimes surprising patterns, identification of relevant causal actors both endogenous and exogenous to the process, and comparison between different historical cases. The combination of data, analytical tools and the organizing theoretical framework of stochastic processes complements traditional narrative approaches in history and archaeology.

CLApr 13, 2021
Semantic maps and metrics for science Semantic maps and metrics for science using deep transformer encoders

Brendan Chambers, James Evans

The growing deluge of scientific publications demands text analysis tools that can help scientists and policy-makers navigate, forecast and beneficially guide scientific research. Recent advances in natural language understanding driven by deep transformer networks offer new possibilities for mapping science. Because the same surface text can take on multiple and sometimes contradictory specialized senses across distinct research communities, sensitivity to context is critical for infometric applications. Transformer embedding models such as BERT capture shades of association and connotation that vary across the different linguistic contexts of any particular word or span of text. Here we report a procedure for encoding scientific documents with these tools, measuring their improvement over static word embeddings in a nearest-neighbor retrieval task. We find discriminability of contextual representations is strongly influenced by choice of pooling strategy for summarizing the high-dimensional network activations. Importantly, we note that fundamentals such as domain-matched training data are more important than state-of-the-art NLP tools. Yet state-of-the-art models did offer significant gains. The best approach we investigated combined domain-matched pretraining, sound pooling, and state-of-the-art deep transformer network encoders. Finally, with the goal of leveraging contextual representations from deep encoders, we present a range of measurements for understanding and forecasting research communities in science.

AIApr 12, 2021
Accelerating science with human versus alien artificial intelligences

Jamshid Sourati, James Evans

Data-driven artificial intelligence models fed with published scientific findings have been used to create powerful prediction engines for scientific and technological advance, such as the discovery of novel materials with desired properties and the targeted invention of new therapies and vaccines. These AI approaches typically ignore the distribution of human prediction engines -- scientists and inventor -- who continuously alter the landscape of discovery and invention. As a result, AI hypotheses are designed to substitute for human experts, failing to complement them for punctuated collective advance. Here we show that incorporating the distribution of human expertise into self-supervised models by training on inferences cognitively available to experts dramatically improves AI prediction of future human discoveries and inventions. Including expert-awareness into models that propose (a) valuable energy-relevant materials increases the precision of materials predictions by ~100%, (b) repurposing thousands of drugs to treat new diseases increases precision by 43%, and (c) COVID-19 vaccine candidates examined in clinical trials by 260%. These models succeed by predicting human predictions and the scientists who will make them. By tuning AI to avoid the crowd, however, it generates scientifically promising "alien" hypotheses unlikely to be imagined or pursued without intervention, not only accelerating but punctuating scientific advance. By identifying and correcting for collective human bias, these models also suggest opportunities to improve human prediction by reformulating science education for discovery.

APNov 4, 2019
Long-range Event-level Prediction and Response Simulation for Urban Crime and Global Terrorism with Granger Networks

Timmy Li, Yi Huang, James Evans et al.

Large-scale trends in urban crime and global terrorism are well-predicted by socio-economic drivers, but focused, event-level predictions have had limited success. Standard machine learning approaches are promising, but lack interpretability, are generally interpolative, and ineffective for precise future interventions with costly and wasteful false positives. Here, we are introducing Granger Network inference as a new forecasting approach for individual infractions with demonstrated performance far surpassing past results, yet transparent enough to validate and extend social theory. Considering the problem of predicting crime in the City of Chicago, we achieve an average AUC of ~90\% for events predicted a week in advance within spatial tiles approximately $1000$ ft across. Instead of pre-supposing that crimes unfold across contiguous spaces akin to diffusive systems, we learn the local transport rules from data. As our key insights, we uncover indications of suburban bias -- how law-enforcement response is modulated by socio-economic contexts with disproportionately negative impacts in the inner city -- and how the dynamics of violent and property crimes co-evolve and constrain each other -- lending quantitative support to controversial pro-active policing policies. To demonstrate broad applicability to spatio-temporal phenomena, we analyze terror attacks in the middle-east in the recent past, and achieve an AUC of ~80% for predictions made a week in advance, and within spatial tiles measuring approximately 120 miles across. We conclude that while crime operates near an equilibrium quickly dissipating perturbations, terrorism does not. Indeed terrorism aims to destabilize social order, as shown by its dynamics being susceptible to run-away increases in event rates under small perturbations.

DLOct 18, 2019
Science and Technology Advance through Surprise

Feng Shi, James Evans

Breakthrough discoveries and inventions involve unexpected combinations of contents including problems, methods, and natural entities, and also diverse contexts such as journals, subfields, and conferences. Drawing on data from tens of millions of research papers, patents, and researchers, we construct models that predict next year's content and context combinations with an AUC of 95% based on embeddings constructed from high-dimensional stochastic block models, where the improbability of new combinations itself predicts up to 50% of the likelihood that they will gain outsized citations and major awards. Most of these breakthroughs occur when problems in one field are unexpectedly solved by researchers from a distant other. These findings demonstrate the critical role of surprise in advance, and enable evaluation of scientific institutions ranging from education and peer review to awards in supporting it.

MLSep 6, 2018
Wasserstein Soft Label Propagation on Hypergraphs: Algorithm and Generalization Error Bounds

Tingran Gao, Shahab Asoodeh, Yi Huang et al.

Inspired by recent interests of developing machine learning and data mining algorithms on hypergraphs, we investigate in this paper the semi-supervised learning algorithm of propagating "soft labels" (e.g. probability distributions, class membership scores) over hypergraphs, by means of optimal transportation. Borrowing insights from Wasserstein propagation on graphs [Solomon et al. 2014], we re-formulate the label propagation procedure as a message-passing algorithm, which renders itself naturally to a generalization applicable to hypergraphs through Wasserstein barycenters. Furthermore, in a PAC learning framework, we provide generalization error bounds for propagating one-dimensional distributions on graphs and hypergraphs using 2-Wasserstein distance, by establishing the \textit{algorithmic stability} of the proposed semi-supervised learning algorithm. These theoretical results also shed new lights upon deeper understandings of the Wasserstein propagation on graphs.

ITMar 22, 2018
Curvature of Hypergraphs via Multi-Marginal Optimal Transport

Shahab Asoodeh, Tingran Gao, James Evans

We introduce a novel definition of curvature for hypergraphs, a natural generalization of graphs, by introducing a multi-marginal optimal transport problem for a naturally defined random walk on the hypergraph. This curvature, termed \emph{coarse scalar curvature}, generalizes a recent definition of Ricci curvature for Markov chains on metric spaces by Ollivier [Journal of Functional Analysis 256 (2009) 810-864], and is related to the scalar curvature when the hypergraph arises naturally from a Riemannian manifold. We investigate basic properties of the coarse scalar curvature and obtain several bounds. Empirical experiments indicate that coarse scalar curvatures are capable of detecting "bridges" across connected components in hypergraphs, suggesting it is an appropriate generalization of curvature on simple graphs.

LGFeb 20, 2018
Adaptive Sampling for Coarse Ranking

Sumeet Katariya, Lalit Jain, Nandana Sengupta et al.

We consider the problem of active coarse ranking, where the goal is to sort items according to their means into clusters of pre-specified sizes, by adaptively sampling from their reward distributions. This setting is useful in many social science applications involving human raters and the approximate rank of every item is desired. Approximate or coarse ranking can significantly reduce the number of ratings required in comparison to the number needed to find an exact ranking. We propose a computationally efficient PAC algorithm LUCBRank for coarse ranking, and derive an upper bound on its sample complexity. We also derive a nearly matching distribution-dependent lower bound. Experiments on synthetic as well as real-world data show that LUCBRank performs better than state-of-the-art baseline methods, even when these methods have the advantage of knowing the underlying parametric model.

SINov 29, 2017
The Wisdom of Polarized Crowds

Feng Shi, Misha Teplitskiy, Eamon Duede et al.

As political polarization in the United States continues to rise, the question of whether polarized individuals can fruitfully cooperate becomes pressing. Although diversity of individual perspectives typically leads to superior team performance on complex tasks, strong political perspectives have been associated with conflict, misinformation and a reluctance to engage with people and perspectives beyond one's echo chamber. It is unclear whether self-selected teams of politically diverse individuals will create higher or lower quality outcomes. In this paper, we explore the effect of team political composition on performance through analysis of millions of edits to Wikipedia's Political, Social Issues, and Science articles. We measure editors' political alignments by their contributions to conservative versus liberal articles. A survey of editors validates that those who primarily edit liberal articles identify more strongly with the Democratic party and those who edit conservative ones with the Republican party. Our analysis then reveals that polarized teams---those consisting of a balanced set of politically diverse editors---create articles of higher quality than politically homogeneous teams. The effect appears most strongly in Wikipedia's Political articles, but is also observed in Social Issues and even Science articles. Analysis of article "talk pages" reveals that politically polarized teams engage in longer, more constructive, competitive, and substantively focused but linguistically diverse debates than political moderates. More intense use of Wikipedia policies by politically diverse teams suggests institutional design principles to help unleash the power of politically polarized teams.

CLAug 19, 2015
Fast, Flexible Models for Discovering Topic Correlation across Weakly-Related Collections

Jingwei Zhang, Aaron Gerow, Jaan Altosaar et al.

Weak topic correlation across document collections with different numbers of topics in individual collections presents challenges for existing cross-collection topic models. This paper introduces two probabilistic topic models, Correlated LDA (C-LDA) and Correlated HDP (C-HDP). These address problems that can arise when analyzing large, asymmetric, and potentially weakly-related collections. Topic correlations in weakly-related collections typically lie in the tail of the topic distribution, where they would be overlooked by models unable to fit large numbers of topics. To efficiently model this long tail for large-scale analysis, our models implement a parallel sampling algorithm based on the Metropolis-Hastings and alias methods (Yuan et al., 2015). The models are first evaluated on synthetic data, generated to simulate various collection-level asymmetries. We then present a case study of modeling over 300k documents in collections of sciences and humanities research from JSTOR.