LG CLMay 27, 2025

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma

arXiv:2505.21800v111.43 citationsh-index: 26

Originality Incremental advance

AI Analysis

This work addresses the issue of truthfulness in LLMs for AI safety and interpretability, representing an incremental advance by applying an existing framework to a new domain.

The paper tackled the problem of LLMs generating falsehoods by extending the concept cone framework to represent truth, identifying multi-dimensional cones that causally mediate truth-related behavior across LLM families, with results showing reliable flipping of responses and generalization across architectures.

Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

View on arXiv PDF

Similar