Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
Identifies and explains failure modes of policy-gradient methods in a specific class of long-horizon decision problems with cumulative damage, providing testable predictions that replicate across two domains.
Policy-gradient methods fail in long-horizon cumulative-damage problems due to two orthogonal failure modes: completion and optimality. The proposed decomposition reveals that horizon access alone reduces completion rate, while action-space restriction achieves completion but leaves an optimality gap of ΔM_final = 0.271.
Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).