CRAIApr 21, 2025

RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents

arXiv:2504.18565v212 citationsh-index: 11
AI Analysis

This addresses the safety risk of uncontrollable autonomous replication for AI developers and policymakers, though it is incremental as it benchmarks existing models without proposing new methods.

The paper introduced RepliBench, a suite of evaluations to measure autonomous replication capabilities of language model agents, finding that current frontier models do not pose a credible threat but succeed on many components and are improving rapidly, with the best model achieving >50% pass@10 scores on 15 out of 20 task families.

Uncontrollable autonomous replication of language model agents poses a critical safety risk. To better understand this risk, we introduce RepliBench, a suite of evaluations designed to measure autonomous replication capabilities. RepliBench is derived from a decomposition of these capabilities covering four core domains: obtaining resources, exfiltrating model weights, replicating onto compute, and persisting on this compute for long periods. We create 20 novel task families consisting of 86 individual tasks. We benchmark 5 frontier models, and find they do not currently pose a credible threat of self-replication, but succeed on many components and are improving rapidly. Models can deploy instances from cloud compute providers, write self-propagating programs, and exfiltrate model weights under simple security setups, but struggle to pass KYC checks or set up robust and persistent agent deployments. Overall the best model we evaluated (Claude 3.7 Sonnet) has a >50% pass@10 score on 15/20 task families, and a >50% pass@10 score for 9/20 families on the hardest variants. These findings suggest autonomous replication capability could soon emerge with improvements in these remaining areas or with human assistance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes