CLAIApr 1, 2025

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

arXiv:2504.00374v19 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of misinformation in AI systems for users relying on LLMs for accurate information, highlighting a vulnerability that could lead to confidently endorsed false claims.

The study investigated the risk of Large Language Models (LLMs) being deceived by persuasive falsehoods in multi-agent debates, finding that even smaller models can craft arguments that override truthful answers with high confidence, as measured by the Confidence-Weighted Persuasion Override Rate (CW-POR).

In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes