AIJan 8
Large language models can effectively convince people to believe conspiraciesThomas H. Costello, Kellin Pelrine, Matthew Kowal et al.
Large language models (LLMs) have been shown to be persuasive across a variety of contexts. But it remains unclear whether this persuasive power advantages truth over falsehood, or if LLMs can promote misbeliefs just as easily as refuting them. Here, we investigate this question across three pre-registered experiments in which participants (N = 2,724 Americans) discussed a conspiracy theory they were uncertain about with GPT-4o, and the model was instructed to either argue against ("debunking") or for ("bunking") that conspiracy. When using a "jailbroken" GPT-4o variant with guardrails removed, the AI was as effective at increasing conspiracy belief as decreasing it. Concerningly, the bunking AI was rated more positively, and increased trust in AI, more than the debunking AI. Surprisingly, we found that using standard GPT-4o produced very similar effects, such that the guardrails imposed by OpenAI did little to prevent the LLM from promoting conspiracy beliefs. Encouragingly, however, a corrective conversation reversed these newly induced conspiracy beliefs, and simply prompting GPT-4o to only use accurate information dramatically reduced its ability to increase conspiracy beliefs. Our findings demonstrate that LLMs possess potent abilities to promote both truth and falsehood, but that potential solutions may exist to help mitigate this risk.
AIJun 3, 2025Code
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful TopicsMatthew Kowal, Jasper Timm, Jean-Francois Godbout et al.
Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval
HCDec 7, 2021
Do explanations increase the effectiveness of AI-crowd generated fake news warnings?Ziv Epstein, Nicolò Foppiani, Sophie Hilgard et al.
Social media platforms are increasingly deploying complex interventions to help users detect false news. Labeling false news using techniques that combine crowd-sourcing with artificial intelligence (AI) offers a promising way to inform users about potentially low-quality information without censoring content, but also can be hard for users to understand. In this study, we examine how users respond in their sharing intentions to information they are provided about a hypothetical human-AI hybrid system. We ask i) if these warnings increase discernment in social media sharing intentions and ii) if explaining how the labeling system works can boost the effectiveness of the warnings. To do so, we conduct a study ($N=1473$ Americans) in which participants indicated their likelihood of sharing content. Participants were randomly assigned to a control, a treatment where false content was labeled, or a treatment where the warning labels came with an explanation of how they were generated. We find clear evidence that both treatments increase sharing discernment, and directional evidence that explanations increase the warnings' effectiveness. Interestingly, we do not find that the explanations increase self-reported trust in the warning labels, although we do find some evidence that participants found the warnings with the explanations to be more informative. Together, these results have important implications for designing and deploying transparent misinformation warning labels, and AI-mediated systems more broadly.