CLLGMar 17, 2025

DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

arXiv:2503.12882v12 citationsh-index: 1ACL
Originality Incremental advance
AI Analysis

This addresses the challenge of removing specific types of toxicity in AI-generated text, which is an incremental improvement over single-vector methods.

The paper tackled the problem of fine-grained detoxification in text generation by proposing a category-specific toxicity probe vector approach, achieving up to a 78.52% reduction in toxicity while maintaining fluency with only a 0.052% drop.

There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes