Delightful Exploration

arXiv:2605.1328759.4

AI Analysis

For practitioners facing large action spaces with limited budgets, DE provides a practical heuristic that bounds disruption while outperforming standard exploration methods in unresolved regimes.

Delight-gated exploration (DE) introduces a host-override rule that spends exploratory actions only when prospective delight (expected improvement times surprisal) exceeds a gate price, recovering Pandora's rule. Across Bernoulli bandits, linear bandits, and tabular MDPs, DE shows much weaker regret growth than Thompson Sampling and ε-greedy in unresolved regimes with the same hyperparameters transferring without retuning.

Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and $\varepsilon$-greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

View on arXiv PDF

Similar