AI CR SEJan 28, 2025

A sketch of an AI control safety case

Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving

arXiv:2501.17315v126.232 citationsh-index: 16

Originality Incremental advance

AI Analysis

This addresses safety concerns for AI developers deploying LLM agents, but it is incremental as it sketches a preliminary framework without full implementation.

The paper tackles the problem of ensuring AI safety as LLM agents become more capable of causing harm, by proposing a 'control safety case' framework to argue that models cannot subvert control measures to cause unacceptable outcomes, using a case study on preventing data exfiltration with evidence from red team evaluations.

As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

View on arXiv PDF

Similar