BALLAST: Bandit-Assisted Learning for Latency-Aware Stable Timeouts in Raft
This work addresses liveness issues in distributed consensus systems like Raft, offering a practical improvement for real-world deployments with network variability, though it appears incremental as it builds on existing timeout heuristics with bandit-based adaptation.
The paper tackles the problem of Raft's randomized election timeouts becoming brittle under long-tail latency and network turbulence, and presents BALLAST, a lightweight online adaptation mechanism using contextual bandits that substantially reduces recovery time and unwritable time in challenging WAN regimes.
Randomized election timeouts are a simple and effective liveness heuristic for Raft, but they become brittle under long-tail latency, jitter, and partition recovery, where repeated split votes can inflate unavailability. This paper presents BALLAST, a lightweight online adaptation mechanism that replaces static timeout heuristics with contextual bandits. BALLAST selects from a discrete set of timeout "arms" using efficient linear contextual bandits (LinUCB variants), and augments learning with safe exploration to cap risk during unstable periods. We evaluate BALLAST on a reproducible discrete-event simulation with long-tail delay, loss, correlated bursts, node heterogeneity, and partition/recovery turbulence. Across challenging WAN regimes, BALLAST substantially reduces recovery time and unwritable time compared to standard randomized timeouts and common heuristics, while remaining competitive on stable LAN/WAN settings.