When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling
This work addresses the problem of optimizing multi-beam LEO satellite scheduling for communication systems, showing that adaptive rewards can hurt performance, which is an incremental finding with practical implications for LLM-DRL integration.
The study tested adaptive reward design for deep reinforcement learning in LEO satellite scheduling and found that static reward weights (342.1 Mbps) outperformed dynamic ones (103.3+/-96.8 Mbps) due to a switching-stability dilemma, where weight adaptation disrupts convergence. Causal probing revealed that a +20% increase in switching penalty boosted performance by +157 Mbps in polar handover and +130 Mbps in hot-cold regimes, and fine-tuned LLMs collapsed to 45.3+/-43.0 Mbps due to weight oscillation.
Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.