AIMay 9

When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

Yann Berthelot, Philippe Preux, Riad Akrour

arXiv:2605.0910910.7

AI Analysis

For researchers deploying RL with suboptimal controllers, this work provides a systematic comparison and actionable guidelines to avoid common pitfalls.

The paper benchmarks query-time expert-guided RL methods on continuous-control tasks with imperfect experts, revealing three failure modes missed in prior work: critic blind spots, residual saturation, and buffer poisoning. No single method dominates; the authors provide a decision rule based on pre-training observables.

Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.

View on arXiv PDF

Similar