LGApr 20

Efficient Federated RLHF via Zeroth-Order Policy Optimization

arXiv:2604.1774742.9h-index: 4
AI Analysis

For federated learning practitioners, this work provides a more efficient RLHF method for edge devices, though the gains are incremental over existing approaches.

This paper addresses the challenge of reinforcement learning from human feedback (RLHF) in federated settings with resource-constrained agents. The proposed algorithm, Par-S$^2$ZPO, achieves lower communication, computation, and memory costs while outperforming a FedAvg-based RLHF baseline on four MuJoCo tasks.

This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes