LGAIFeb 25

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

arXiv:2602.22296v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

It addresses the issue of narrow exploration in LLMs for mathematics and programming tasks, which is incremental as it adapts existing methods to improve diversity.

The paper tackles the problem of suppressed response diversity in LLMs during reinforcement learning for reasoning tasks, introducing UpSkill to optimize pass@k correctness and showing gains of ~3% in pass@k for models like Qwen and Llama without degrading pass@1.

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes