AILGSep 29, 2025

Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models

arXiv:2509.24261v18 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses a bottleneck in enhancing LLMs for complex reasoning tasks, offering a simple method to boost multi-solution performance without sacrificing single-solution accuracy.

The paper tackled the exploration dilemma in reinforcement learning for large language models, where standard methods limit solution diversity, and introduced a risk-sensitive framework that improved pass@k performance on mathematical reasoning benchmarks while maintaining pass@1 accuracy.

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of solutions, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. To overcome this, we introduce a Risk-Sensitive Reinforcement Learning framework. Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm, Risk-Sensitive GRPO (RS-GRPO), which drives deeper exploration by amplifying learning from challenging prompts. Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications. On six mathematical reasoning benchmarks and with five different LLMs, RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes