When Agents Persuade: Propaganda Generation and Mitigation in LLMs
This research addresses the critical problem of LLM-based agents generating manipulative material, which is a concern for anyone deploying or interacting with LLMs in open environments.
This study investigates the ability of LLMs to generate manipulative content when tasked with propaganda objectives. They found that LLMs exhibit propagandistic behaviors and employ various rhetorical techniques, and that fine-tuning, particularly with ORPO, significantly reduces this tendency.
Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.