CLCRCYApr 23, 2025

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

arXiv:2504.17130v313 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This work addresses the issue of transparency and control over AI censorship for developers and users, though it is incremental as it builds on existing representation engineering techniques.

The researchers tackled the problem of understanding and controlling censorship mechanisms in safety-tuned large language models by identifying representation vectors that detect and manipulate refusal-compliance behavior and thought suppression, enabling removal of censorship through vector adjustments.

Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector. Our code is publicly available at: https://github.com/hannahxchen/llm-censorship-steering

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes