LGGTMAOCSep 26, 2025

Learning from Delayed Feedback in Games via Extra Prediction

arXiv:2509.22426v21 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses a bottleneck in multi-agent reinforcement learning for game theory applications, offering an incremental improvement over existing methods.

The study tackles the problem of time-delayed feedback in multi-agent learning games, showing that even a single-step delay worsens performance, and proposes Weighted Optimistic Follow-the-Regularized-Leader (WOFTRL) to recover constant social regret and Nash equilibrium convergence when the optimistic weight exceeds the delay.

This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes