LGJun 16, 2021

How memory architecture affects learning in a simple POMDP: the two-hypothesis testing problem

arXiv:2106.08849v2
Originality Incremental advance
AI Analysis

This work addresses the challenge of training reinforcement learning agents in partially observable environments, offering insights into memory design for improved optimization, though it is incremental as it focuses on a specific simplified problem.

The paper tackles the trade-off between memory architecture complexity and training performance in a POMDP, specifically a two-hypothesis testing problem, showing that a simple fixed memory can achieve exponentially small error rates similar to a complex random access memory, with empirical results indicating better training from random initialization for the constrained architecture.

Reinforcement learning is generally difficult for partially observable Markov decision processes (POMDPs), which occurs when the agent's observation is partial or noisy. To seek good performance in POMDPs, one strategy is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in a two-hypothesis testing problem, akin to the two-arm bandit problem. We compare two extreme cases: (i) the random access memory where any transitions between $M$ memory states are allowed and (ii) a fixed memory where the agent can access its last $m$ actions and rewards. For (i), the probability $q$ to play the worst arm is known to be exponentially small in $M$ for the optimal policy. Our main result is to show that similar performance can be reached for (ii) as well, despite the simplicity of the memory architecture: using a conjecture on Gray-ordered binary necklaces, we find policies for which $q$ is exponentially small in $2^m$, i.e. $q\simα^{2^m}$ with $α< 1$. In addition, we observe empirically that training from random initialization leads to very poor results for (i), and significantly better results for (ii) thanks to the constraints on the memory architecture.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes