AIMay 9

When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

arXiv:2605.0910910.7
AI Analysis

For researchers deploying RL with suboptimal controllers, this work provides a systematic comparison and actionable guidelines to avoid common pitfalls.

The paper benchmarks query-time expert-guided RL methods on continuous-control tasks with imperfect experts, revealing three failure modes missed in prior work: critic blind spots, residual saturation, and buffer poisoning. No single method dominates; the authors provide a decision rule based on pre-training observables.

Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes