On Almost Surely Safe Alignment of Large Language Models at Inference-Time
This addresses the safety problem for LLM deployment by providing a resource-efficient alternative to methods like RLHF, though it is incremental as it builds on existing inference-time alignment techniques.
The paper tackles the problem of ensuring large language models generate safe responses by introducing InferenceGuard, an inference-time alignment approach that models safe generation as a constrained Markov Decision Process in the latent space, achieving formal safety guarantees and outperforming existing methods in balancing safety and task performance.
We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM's latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generations to ensure the generation of safe responses. Consequently, we demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate that InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses. Our findings contribute to the advancement of safer LLM deployment through alignment at inference-time, thus presenting a promising alternative to resource-intensive, overfitting-prone alignment techniques like RLHF.