RL agents Implicitly Learning Human Preferences
This work addresses the challenge of making RL agents beneficial for humans by leveraging learned preferences, though it appears incremental as it builds on existing methods for preference modeling.
The paper tackles the problem of aligning RL agents with human preferences by demonstrating that agents implicitly learn these preferences, achieving a 0.93 AUC in predicting human preferences from neural activations compared to 0.8 from raw environment states.
In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.