LGDec 13, 2025

Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

arXiv:2512.12469v24.1

Originality Incremental advance

AI Analysis

This addresses the need for interpretable and steerable behavior in learned representations for AI practitioners, though it appears incremental as it builds on structured autoencoders.

The paper tackled the problem of making neural representations interpretable and controllable by introducing Sparse Concept Anchoring, which biases latent spaces to position targeted concepts with minimal supervision, resulting in selective attenuation of concepts with negligible impact on orthogonal features and complete elimination approaching theoretical bounds.

We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for <0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring therefore provides a practical pathway to interpretable, steerable behavior in learned representations.

View on arXiv PDF

Similar