LGApr 3

CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

arXiv:2604.227855.3

Predicted impact top 77% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a principled framework for training multiple LLMs in multi-agent settings, solving a key bottleneck for deploying such systems.

CoFi-PGMA addresses the problem of learning under filtered feedback in multi-agent LLM systems, where routing or collaboration distorts the learning signal. The method derives a counterfactual per-agent objective that corrects for selection-gated or shared rewards, and demonstrates effectiveness on a real-world reasoning dataset.

Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal received by each agent is filtered by the system mechanism. Routing produces selection-gated feedback where only the chosen response is evaluated, while collaboration produces shared rewards that obscure the individual contribution of each agent. As a result, standard RLHF objectives designed for a single deployed policy become misspecified. We introduce CoFi-PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs), a unified framework for learning under filtered feedback in multi-agent LLM systems. Our approach derives a counterfactual per-agent training objective based on marginal contribution, which corrects the learning signal under both routing and collaborative mechanisms. For routing systems, the objective corresponds to off-policy corrections for selection-gated feedback, while for collaborative systems it reduces to leave-one-out difference rewards for credit assignment. We further analyze how softmax routing induces risk-sensitive incentives and provide practical training algorithms that integrate counterfactual estimators, multiturn-aware rewards, and policy optimization methods, and demonstrate the approach on a real-world reasoning dataset.

View on arXiv PDF

Similar