Lying Blindly: Bypassing ChatGPT's Safeguards to Generate Hard-to-Detect Disinformation Claims
This highlights a critical vulnerability in AI safety for combating disinformation, as it shows that current safeguards and detection methods are insufficient, posing risks for misuse in coordinated campaigns.
The study investigated ChatGPT's ability to generate short-form disinformation claims about the war in Ukraine, including events beyond its knowledge cutoff, using a straightforward prompting technique that bypassed safeguards, and found that these AI-generated claims were realistic and indistinguishable from human-authored false claims by humans or automated tools.
As Large Language Models become more proficient, their misuse in coordinated disinformation campaigns is a growing concern. This study explores the capability of ChatGPT with GPT-3.5 to generate short-form disinformation claims about the war in Ukraine, both in general and on a specific event, which is beyond the GPT-3.5 knowledge cutoff. Unlike prior work, we do not provide the model with human-written disinformation narratives by including them in the prompt. Thus the generated short claims are hallucinations based on prior world knowledge and inference from the minimal prompt. With a straightforward prompting technique, we are able to bypass model safeguards and generate numerous short claims. We compare those against human-authored false claims on the war in Ukraine from ClaimReview, specifically with respect to differences in their linguistic properties. We also evaluate whether AI authorship can be differentiated by human readers or state-of-the-art authorship detection tools. Thus, we demonstrate that ChatGPT can produce realistic, target-specific disinformation claims, even on a specific post-cutoff event, and that they cannot be reliably distinguished by humans or existing automated tools.