Random Policy Evaluation Uncovers Policies of Generative Flow Networks
This work bridges a gap between GFlowNets and RL, potentially simplifying implementation and improving diverse solution discovery for researchers in probabilistic modeling and sequential decision-making, though it appears incremental in connecting existing frameworks.
The paper tackles the problem of connecting Generative Flow Networks (GFlowNets) to standard reinforcement learning (RL) by revealing a fundamental link to policy evaluation, showing that evaluating a uniform policy relates to GFlowNet flow functions under certain conditions. It introduces a rectified random policy evaluation (RPE) algorithm that achieves competitive results on benchmarks, matching GFlowNet reward-matching effects.
The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects proportionally to an unnormalized reward function. A number of recent works explored connections between GFlowNets and maximum entropy (MaxEnt) RL, which modifies the standard objective of RL agents by learning an entropy-regularized objective. However, the relationship between GFlowNets and standard RL remains largely unexplored, despite the inherent similarities in their sequential decision-making nature. While GFlowNets can discover diverse solutions through specialized flow-matching objectives, connecting them can simplify their implementation through established RL principles and improve RL's diverse solution discovery capabilities. In this paper, we bridge this gap by revealing a fundamental connection between GFlowNets and one RL's most basic components -- policy evaluation. Surprisingly, we find that the value function obtained from evaluating a uniform policy is closely associated with the flow functions in GFlowNets through the lens of flow iteration under certain structural conditions. Building upon these insights, we introduce a rectified random policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets based on simply evaluating a fixed random policy in these cases, offering a new perspective. Empirical results across extensive benchmarks demonstrate that RPE achieves competitive results compared to previous approaches, shedding light on the previously overlooked connection between (non-MaxEnt) RL and GFlowNets.