LGAIMar 4

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

arXiv:2603.15647h-index: 9
Originality Incremental advance
AI Analysis

This addresses the safety and adaptability challenge for LLM users and developers by enabling dynamic alignment without retraining, though it is incremental as it builds on existing prompt routing and bandit methods.

The paper tackles the problem of static safety alignment in large language models (LLMs) by proposing an inference-time governance framework that adapts to evolving jailbreak behaviors and pluralistic safety norms, resulting in a 10.98% improvement in cumulative reward and a 14.42% reduction in average suboptimality gap.

Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes