LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

arXiv:2602.23329v11 citationsh-index: 7
Originality Highly original
AI Analysis

This research addresses the critical question of whether LLMs empower novice users to perform complex biological tasks, impacting both scientific acceleration and dual-use risk assessment for policymakers and researchers.

This study investigated whether large language models (LLMs) uplift novice users on dual-use, in silico biology tasks, finding that novices with LLM access were 4.16 times more accurate than controls using internet-only resources. On four benchmarks with expert baselines, LLM-assisted novices outperformed experts on three, though standalone LLMs often exceeded human-LLM performance.

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes