Probe-Free Low-Rank Activation Intervention
This addresses the issue of unreliable outputs in language models for users requiring accurate and safe text generation, representing an incremental improvement by eliminating the need for trained probes.
The paper tackles the problem of language models generating untruthful or toxic content by proposing FLORAIN, a probe-free activation intervention method that modifies hidden activations to steer generation, resulting in consistent outperformance over baselines in enhancing truthfulness and quality across tasks.
Language models (LMs) can produce texts that appear accurate and coherent but contain untruthful or toxic content. Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations. Existing activation intervention methods often comprise an activation probe to detect undesirable generation, triggering the activation modification to steer subsequent generation. This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer. It eliminates the need to train classifiers for probing purposes. The intervention function is parametrized by a sample-wise nonlinear low-rank mapping, which is trained by minimizing the distance between the modified activations and their projection onto the manifold of desirable content. Under specific constructions of the manifold and projection distance, we show that the intervention strategy can be computed efficiently by solving a smooth optimization problem. The empirical results, benchmarked on multiple base models, demonstrate that FLORAIN consistently outperforms several baseline methods in enhancing model truthfulness and quality across generation and multiple-choice tasks.