Marvin Gülhan

h-index5

2papers

2 Papers

36.7CLMay 29

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

Magnus Jørgenvåg, David Kaczér, Lasse Ruttert et al.

Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.

AIApr 26, 2024

Airlift Challenge: A Competition for Optimizing Cargo Delivery

Adis Delanovic, Carmen Chiu, John F. Kolen et al.

Airlift operations require the timely distribution of various cargo, much of which is time sensitive and valuable. These operations, however, have to contend with sudden disruptions from weather and malfunctions, requiring immediate rescheduling. The Airlift Challenge competition seeks possible solutions via a simulator that provides a simplified abstraction of the airlift problem. The simulator uses an OpenAI gym interface that allows participants to create an algorithm for planning agent actions. The algorithm is scored using a remote evaluator against scenarios of ever-increasing difficulty. The second iteration of the competition was underway from November 2023 to April 2024. This paper describes the competition, simulation environment, and results. As a step towards applying generalized planning techniques to the problem, a temporal PDDL domain is presented for the Pickup and Delivery Problem, a model which lies at the core of the Airlift Challenge.