LG AINov 24, 2025

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

arXiv:2511.19253v2

Originality Incremental advance

AI Analysis

This addresses bottlenecks in MARL for applications like traffic signal control, offering an incremental improvement by integrating LLMs into training without increasing deployment costs.

The paper tackles the problem of designing dense reward functions and curricula in cooperative Multi-Agent Reinforcement Learning (MARL) by proposing MAESTRO, a framework that uses Large Language Models (LLMs) as offline training architects to generate curricula and reward functions, resulting in a +4.0% higher mean return and 2.2% better risk-adjusted performance over a baseline.

Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.

View on arXiv PDF

Similar