LGAIMLApr 5, 2019

Multi-Preference Actor Critic

arXiv:1904.03295v14 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating diverse human insights into policy learning for reinforcement learning tasks, representing an incremental improvement.

The paper tackles the problem of incorporating multiple human feedback channels into reinforcement learning by introducing Multi-Preference Actor Critic (M-PAC), which uses constraints on the policy via Lagrangian relaxation, and results show that constraints are respected and learning is accelerated in Atari and Pendulum experiments.

Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes