91.3CYMar 11
RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI EvaluationPatricia Paskov, Kevin Wei, Shen Zhou Hong et al. · cmu
Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.
CYJan 23, 2024
Visibility into AI AgentsAlan Chan, Carson Ezell, Max Kaufmann et al. · cambridge
Increased delegation of commercial, scientific, governmental, and personal activities to AI agents -- systems capable of pursuing complex goals with limited supervision -- may exacerbate existing societal risks and introduce new risks. Understanding and mitigating these risks involves critically evaluating existing governance structures, revising and adapting these structures where needed, and ensuring accountability of key stakeholders. Information about where, why, how, and by whom certain AI agents are used, which we refer to as visibility, is critical to these objectives. In this paper, we assess three categories of measures to increase visibility into AI agents: agent identifiers, real-time monitoring, and activity logging. For each, we outline potential implementations that vary in intrusiveness and informativeness. We analyze how the measures apply across a spectrum of centralized through decentralized deployment contexts, accounting for various actors in the supply chain including hardware and software service providers. Finally, we discuss the implications of our measures for privacy and concentration of power. Further work into understanding the measures and mitigating their negative impacts can help to build a foundation for the governance of AI agents.
MAFeb 19, 2025
Multi-Agent Risks from Advanced AILewis Hammond, Alan Chan, Jesse Clifton et al. · stanford
The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI.
SEFeb 3, 2025
The AI Agent IndexStephen Casper, Luke Bailey, Rosco Hunter et al.
Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system's components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at https://aiagentindex.mit.edu/
CYAug 19, 2025
Incident Analysis for AI AgentsCarson Ezell, Xavier Roberts-Gaal, Alan Chan
As AI agents become more widely deployed, we are likely to see an increasing number of incidents: events involving AI agent use that directly or indirectly cause harm. For example, agents could be prompt-injected to exfiltrate private information or make unauthorized purchases. Structured information about such incidents (e.g., user prompts) can help us understand their causes and prevent future occurrences. However, existing incident reporting processes are not sufficient for understanding agent incidents. In particular, such processes are largely based on publicly available data, which excludes useful, but potentially sensitive, information such as an agent's chain of thought or browser history. To inform the development of new, emerging incident reporting processes, we propose an incident analysis framework for agents. Drawing on systems safety approaches, our framework proposes three types of factors that can cause incidents: system-related (e.g., CBRN training data), contextual (e.g., prompt injections), and cognitive (e.g., misunderstanding a user request). We also identify specific information that could help clarify which factors are relevant to a given incident: activity logs, system documentation and access, and information about the tools an agent uses. We provide recommendations for 1) what information incident reports should include and 2) what information developers and deployers should retain and make available to incident investigators upon request. As we transition to a world with more agents, understanding agent incidents will become increasingly crucial for managing risks.
CYNov 26, 2025
Dark Speculation: Combining Qualitative and Quantitative Understanding in Frontier AI Risk AnalysisDaniel Carpenter, Carson Ezell, Pratyush Mallick et al.
Estimating catastrophic harms from frontier AI is hindered by deep ambiguity: many of its risks are not only unobserved but unanticipated by analysts. The central limitation of current risk analysis is the inability to populate the $\textit{catastrophic event space}$, or the set of potential large-scale harms to which probabilities might be assigned. This intractability is worsened by the $\textit{Lucretius problem}$, or the tendency to infer future risks only from past experience. We propose a process of $\textit{dark speculation}$, in which systematically generating and refining catastrophic scenarios ("qualitative" work) is coupled with estimating their likelihoods and associated damages (quantitative underwriting analysis). The idea is neither to predict the future nor to enable insurance for its own sake, but to use narrative and underwriting tools together to generate probability distributions over outcomes. We formalize this process using a simplified catastrophic Lévy stochastic framework and propose an iterative institutional design in which (1) speculation (including scenario planning) generates detailed catastrophic event narratives, (2) insurance underwriters assign probabilistic and financial parameters to these narratives, and (3) decision-makers synthesize the results into summary statistics to inform judgment. Analysis of the model reveals the value of (a) maintaining independence between speculation and underwriting, (b) analyzing multiple risk categories in parallel, and (c) generating "thick" catastrophic narrative rich in causal (counterfactual) and mitigative detail. While the approach cannot eliminate deep ambiguity, it offers a systematic approach to reason about extreme, low-probability events in frontier AI, tempering complacency and overreaction. The framework is adaptable for iterative use and can be further augmented with AI systems.
CYJan 25, 2024
Black-Box Access is Insufficient for Rigorous AI AuditsStephen Casper, Carson Ezell, Charlotte Siegmann et al.
External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to training and deployment information (e.g., methodology, code, documentation, data, deployment details, findings from internal evaluations) allows auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.