An Example Safety Case for Safeguards Against Misuse
This work addresses the challenge for AI developers in rigorously justifying low misuse risks, though it is incremental as it builds on existing evaluation methods.
The paper tackles the problem of evaluating AI misuse safeguards by proposing an end-to-end safety case that demonstrates how safeguards can reduce AI assistant risks to low levels, using a quantitative uplift model to estimate dissuasion of misuse.
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.