AIApr 4

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

arXiv:2604.0356214.4
AI Analysis

This work addresses the problem of optimizing multi-beam LEO satellite scheduling for communication systems, showing that adaptive rewards can hurt performance, which is an incremental finding with practical implications for LLM-DRL integration.

The study tested adaptive reward design for deep reinforcement learning in LEO satellite scheduling and found that static reward weights (342.1 Mbps) outperformed dynamic ones (103.3+/-96.8 Mbps) due to a switching-stability dilemma, where weight adaptation disrupts convergence. Causal probing revealed that a +20% increase in switching penalty boosted performance by +157 Mbps in polar handover and +130 Mbps in hot-cold regimes, and fine-tuned LLMs collapsed to 45.3+/-43.0 Mbps due to weight oscillation.

Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes