When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
For LLM practitioners, this work addresses the problem of premature commitment in autoregressive generation by making disclosure timing controllable, offering a practical method to balance reasoning quality and response latency.
The paper introduces Side-by-Side (SxS) Interleaved Reasoning, which decouples disclosure timing from reasoning in LLMs to improve accuracy-latency trade-offs. On Qwen3-30B-A3B and Qwen3-4B, SxS achieves better Pareto frontiers on AIME25 and GPQA-Diamond benchmarks.
In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce \textbf{\emph{Side-by-Side (SxS)}} Interleaved Reasoning, which makes \emph{disclosure timing} a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is \emph{supported} by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE \textbf{Qwen3-30B-A3B}, dense \textbf{Qwen3-4B}) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--\emph{content-latency} Pareto trade-offs under token-level proxies (e.g., inter-update waiting).