LGAIDec 20, 2019

Soft Q Network

arXiv:1912.10891v22 citations
Originality Incremental advance
AI Analysis

This work addresses stability and efficiency issues in deep reinforcement learning for complex environments like football simulation, but appears incremental as it builds on existing DQN methods.

The authors tackled the exploit-explore balance problem in reinforcement learning by introducing entropy regularization into Deep Q Networks, proposing SQN and an on-policy version called QOP, which showed great stability and efficiency in training agents on the Google Research Football environment.

Deep Q Network (DQN) is a very successful algorithm, yet the inherent problem of reinforcement learning, i.e. the exploit-explore balance, remains. In this work, we introduce entropy regularization into DQN and propose SQN. We find that the backup equation of soft Q learning can enjoy the corrective feedback if we view the soft backup as policy improvement in the form of Q, instead of policy evaluation. We show that Soft Q Learning with Corrective Feedback (SQL-CF) underlies the on-plicy nature of SQL and the equivalence of SQL and Soft Policy Gradient (SPG). With these insights, we propose an on-policy version of deep Q learning algorithm, i.e. Q On-Policy (QOP). We experiment with QOP on a self-play environment called Google Research Football (GRF). The QOP algorithm exhibits great stability and efficiency in training GRF agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes