CLMar 2

PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan Lin

arXiv:2603.02023v11.11 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the challenge of optimizing inference compute allocation for language models, offering a domain-specific improvement that is incremental over prior methods like PonderLM-2.

The paper tackles the problem of efficiently allocating additional computation at inference in language models by introducing PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate computation under self-supervised objectives, achieving lower pretraining perplexity at equal inference FLOPs and comparable downstream performance with fewer inference FLOPs compared to baselines.

Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.

View on arXiv PDF

Similar