LGAICLOct 30, 2024

Focus On This, Not That! Steering LLMs with Adaptive Feature Specification

arXiv:2410.22944v46 citationsh-index: 116ICML
Originality Highly original
AI Analysis

This addresses the issue of model misalignment and bias in LLMs for users seeking more controllable and fair AI systems, representing a novel method rather than an incremental improvement.

The paper tackles the problem of large language models (LLMs) leveraging spurious or biased features from training data, leading to misalignment and undesired behaviors, by introducing Focus Instruction Tuning (FIT) to train LLMs to condition responses on specific features, resulting in improved robustness, bias mitigation, and generalization across diverse benchmarks.

Despite the success of Instruction Tuning (IT) in training large language models (LLMs), such models often leverage spurious or biased features learnt from their training data and can become misaligned, leading to undesired behaviours. While existing techniques can steer model behaviour at inference-time, they are often post-hoc and do not embed steering as an intrinsic model feature. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across diverse benchmarks, we demonstrate that FIT: (i) successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features. FIT therefore offers a lightweight, intrinsic mechanism for building more robust, fair, and easily controllable LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes