CLSep 25, 2025

Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

arXiv:2509.21305v214 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the issue of sycophancy in LLMs for AI safety and interpretability, providing a causal separation that is incremental in refining behavioral analysis.

The paper tackled the problem of understanding sycophantic behaviors in large language models by decomposing them into distinct types like agreement and praise, and found that these behaviors are encoded along separate linear directions in latent space and can be independently manipulated without affecting each other, with results consistent across models and scales.

Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes