LGAIMar 14, 2025

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

arXiv:2503.11127v15 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses safety risks in AI by enabling explicit unlearning of dangerous information, though it is incremental as it builds on existing SAE methods.

The paper tackled the problem of removing harmful knowledge from large language models by using Sparse Autoencoders to identify and steer unwanted concepts, reducing the model's ability to answer harmful questions while maintaining performance on harmless queries.

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the wrong hands or during malfunctions. Because of their nature as near-black boxes, intuitive interpretation of LLM internals remains an open research question, preventing developers from easily controlling model behavior and capabilities. The use of Sparse Autoencoders (SAEs) has recently emerged as a potential method of unraveling representations of concepts in LLMs internals, and has allowed developers to steer model outputs by directly modifying the hidden activations. In this paper, we use SAEs to identify unwanted concepts from the Weapons of Mass Destruction Proxy (WMDP) dataset within gemma-2-2b internals and use feature steering to reduce the model's ability to answer harmful questions while retaining its performance on harmless queries. Our results bring back optimism to the viability of SAE-based explicit knowledge unlearning techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes