LGApr 9

Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning

arXiv:2604.0872812.8h-index: 9

Predicted impact top 89% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For multi-agent RL practitioners, this work addresses the gap between idealized communication assumptions and realistic wireless channels, offering a principled way to incorporate communication topology into value decomposition.

CLOVER introduces a value decomposition method for multi-agent reinforcement learning that conditions on realistic wireless communication graphs, improving convergence speed and final performance over baselines like QMIX and VDN on Predator-Prey and Lumberjacks benchmarks.

Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.

View on arXiv PDF

Similar