Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
For practitioners of LLM alignment, this work offers a lightweight method to mitigate reward hacking in inference-time alignment, though it is an incremental extension of existing techniques.
The paper extends inference-time alignment by introducing reference-model temperature adjustment and a sharpened logarithmic opinion pool (SLOP) for combining generative reward models, proposing a calibration algorithm that improves robustness against reward hacking while maintaining alignment performance.
Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.