CLLGFeb 22

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

arXiv:2602.19049v12 citationsh-index: 7Has Code
Originality Highly original
AI Analysis

This addresses token efficiency for users of large language models, offering a novel method for controlling reasoning effort allocation, though it builds on existing reward-shaping approaches.

The paper tackles the problem of high inference-time costs in large language models by proposing IAPO, an information-theoretic post-training framework that reduces reasoning length by up to 36% while improving accuracy across various datasets.

Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes