AIJul 23, 2025

The Geometry of Harmfulness in LLMs through Subconcept Probing

arXiv:2507.21141v15 citationsh-index: 3
Originality Highly original
AI Analysis

This work addresses the critical issue of AI safety for developers and users by providing interpretable tools to audit and mitigate harmful outputs in LLMs, though it is incremental in building on existing concept subspace methods.

The paper tackles the problem of understanding and reducing harmful behaviors in large language models by introducing a framework that probes 55 harmfulness subconcepts, revealing a low-rank subspace in activation space, and shows that steering in the dominant direction can nearly eliminate harmfulness with minimal utility loss.

Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes