Beyond Shallow Behavior: Task-Efficient Value-Based Multi-Task Offline MARL via Skill Discovery
This work addresses the problem of inefficient task-specific offline MARL methods for domains with high interaction costs and risks, providing a solution for researchers and practitioners working in multi-task reinforcement learning.
The authors tackled the problem of task-efficient value-based multi-task offline MARL, achieving superior generalization performance with up to 68.9% improvement on individual task sets, and outperforming other methods on 13 out of 14 task sets. Their approach eliminates the need for retraining for new tasks, reducing redundancy and inefficiency.
As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, we propose a task-efficient value-based multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing methods decoding actions from skills via behavior cloning, SD-CQL discovers skills in a latent space by reconstructing the next observation, evaluates fixed and variable actions separately, and uses conservative Q-learning with local value calibration to select the optimal action for each skill. It eliminates the need for local-global alignment and enables strong multi-task generalization from limited, small-scale source tasks. Substantial experiments on StarCraft II demonstrate the superior generalization performance and task-efficiency of SD-CQL. It achieves the best performance on $\textbf{13}$ out of $14$ task sets, with up to $\textbf{68.9%}$ improvement on individual task sets.