CLOct 8, 2025

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

arXiv:2510.06700v1h-index: 9
Originality Highly original
AI Analysis

This addresses a key cognitive bias in LLMs that affects their reliability in logical reasoning applications, offering a method to mitigate it.

The study investigated how large language models (LLMs) conflate logical validity with plausibility in reasoning tasks, showing that these concepts are linearly represented and aligned in their internal representations, and it developed debiasing vectors that reduced content effects and improved reasoning accuracy.

Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes