LGMay 22

State commitment learning: training language models to distinguish computation from memory

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang

arXiv:2606.0520192.7

AI Analysis

For developers of reasoning language models, this addresses the problem of unreliable dependence on temporary computation, offering a training method that improves robustness of persistent state.

Language models currently treat all generated tokens as persistent state, causing downstream reasoning to depend on temporary scratch work. The authors propose state commitment learning via Counterfactual Erasure RL (CERL), which trains models to distinguish computation from memory, reducing answer dependence on hidden thoughts without sacrificing accuracy across math, logic, scientific QA, and tool-use tasks.

Reasoning language models do not distinguish tokens used for computation from tokens that constitute persistent state: once generated, all hidden thoughts remain in context and influence future predictions. As a result, downstream reasoning may depend on failed attempts, dead ends, and private scratch work that should not be safely relied on later. We recast this phenomenon as a new training objective, state commitment learning: training models to explicitly distinguish information that should be committed as persistent state from temporary computation that can be discarded. We define a counterfactual criterion, persistent-state sufficiency, which makes it trainable and measurable whether an answer remains usable after hidden thoughts are erased. We then propose Counterfactual Erasure RL (CERL), which evaluates, under the same prefix, both a path that keeps hidden thoughts and a path that erases them, and gives reward only when the erasure path remains correct. We also introduce the Erasure Dependence Protocol and show across mathematics, long-chain logic, scientific QA, and multi-turn tool-use evaluation that CERL substantially reduces answer dependence on hidden thoughts without sacrificing accuracy, consistently outperforming correctness-only RL and long-answer SFT baselines.

View on arXiv PDF

Similar