CLAIApr 16, 2024

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

arXiv:2404.10464v310 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This addresses the issue of toxicity in language models for users and developers, offering a more practical solution with lower resource costs, though it is incremental as it builds on existing representation engineering techniques.

The paper tackles the problem of language models generating toxic outputs by proposing DeStein, a method that uses representation engineering with universal steering pairs and head-wise activation fusion, achieving significant outperformance over previous state-of-the-art approaches on various metrics while maintaining generation quality and diversity.

Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes