Dillon Bowen

CR
h-index16
7papers
419citations
Novelty45%
AI Score51

7 Papers

CRAug 6, 2024
Scaling Trends for Data Poisoning in LLMs

Dillon Bowen, Brendan Murphy, Will Cai et al.

LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data. We demonstrate that GPT models remain vulnerable to fine-tuning on poisoned data, even when safeguarded by moderation systems. Given the persistence of data poisoning vulnerabilities in today's most capable models, this paper investigates whether these risks increase with model scaling. We evaluate three threat models -- malicious fine-tuning, imperfect data curation, and intentional data contamination -- across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

AIDec 8, 2025Code
Auditing Games for Sandbagging

Jordan Taylor, Sid Black, Dillon Bowen et al.

Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .

LGFeb 15, 2024
A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen et al.

Most jailbreak papers claim the jailbreaks they propose are highly effective, often boasting near-100% attack success rates. However, it is perhaps more common than not for jailbreak developers to substantially exaggerate the effectiveness of their jailbreaks. We suggest this problem arises because jailbreak researchers lack a standard, high-quality benchmark for evaluating jailbreak performance, leaving researchers to create their own. To create a benchmark, researchers must choose a dataset of forbidden prompts to which a victim model will respond, along with an evaluation method that scores the harmfulness of the victim model's responses. We show that existing benchmarks suffer from significant shortcomings and introduce the StrongREJECT benchmark to address these issues. StrongREJECT's dataset contains prompts that victim models must answer with specific, harmful information, while its automated evaluator measures the extent to which a response gives useful information to forbidden prompts. In doing so, the StrongREJECT evaluator achieves state-of-the-art agreement with human judgments of jailbreak effectiveness. Notably, we find that existing evaluation methods significantly overstate jailbreak effectiveness compared to human judgments and the StrongREJECT evaluator. We describe a surprising and novel phenomenon that explains this discrepancy: jailbreaks bypassing a victim model's safety fine-tuning tend to reduce its capabilities. Together, our findings underscore the need for researchers to use a high-quality benchmark, such as StrongREJECT, when developing new jailbreak attacks. We release the StrongREJECT code and data at https://strong-reject.readthedocs.io/en/latest/.

CYJul 8, 2025Code
The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave et al.

Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed. We hope that the Safety Gap Toolkit (https://github.com/AlignmentResearch/safety-gap) will serve as an evaluation framework for common open-source models and as a motivation for developing and testing tamper-resistant safeguards. We welcome contributions to the toolkit from the community.

CRJul 15, 2025
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh et al.

AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models with safeguards destroyed. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attacks and potentially defenses in the input and weight spaces. Not only are current models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.

CYMar 17, 2025
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations

Dillon Bowen, Ann-Kathrin Dombrowski, Adam Gleave et al.

The rapid advancement of AI systems has raised widespread concerns about potential harms of frontier AI systems and the need for responsible evaluation and oversight. In this position paper, we argue that frontier AI companies should report both pre- and post-mitigation safety evaluations to enable informed policy decisions. Evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards. We show that relying on either in isolation can create a misleading picture of model safety. Our analysis of AI safety disclosures from leading frontier labs identifies three critical gaps: (1) companies rarely evaluate both pre- and post-mitigation versions, (2) evaluation methods lack standardization, and (3) reported results are often too vague to inform policy. To address these issues, we recommend mandatory disclosure of pre- and post-mitigation capabilities to approved government bodies, standardized evaluation methods, and minimum transparency requirements for public safety reporting. These ensure that policymakers and regulators can craft targeted safety measures, assess deployment risks, and scrutinize companies' safety claims effectively.

LGJun 12, 2020
Generalized SHAP: Generating multiple types of explanations in machine learning

Dillon Bowen, Lyle Ungar

Many important questions about a model cannot be answered just by explaining how much each feature contributes to its output. To answer a broader set of questions, we generalize a popular, mathematically well-grounded explanation technique, Shapley Additive Explanations (SHAP). Our new method - Generalized Shapley Additive Explanations (G-SHAP) - produces many additional types of explanations, including: 1) General classification explanations; Why is this sample more likely to belong to one class rather than another? 2) Intergroup differences; Why do our model's predictions differ between groups of observations? 3) Model failure; Why does our model perform poorly on a given sample? We formally define these types of explanations and illustrate their practical use on real data.