CLSep 16, 2024

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective

arXiv:2409.10053v234 citationsh-index: 7
AI Analysis

This addresses the limitation of existing methods in achieving greater performance improvement while maintaining activation consistency for researchers and practitioners in AI safety.

The paper tackles the problem of activation editing in large language models by proposing a novel method that views activations in terms of directions and magnitudes, resulting in improved performance on safety benchmarks.

Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes