CLLGJun 5, 2025

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

arXiv:2506.06395v348 citationsh-index: 8
Originality Incremental advance
AI Analysis

This provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision, which is incremental as it builds on existing RL fine-tuning approaches.

The paper tackles the problem of costly human annotations or external reward models in reinforcement learning fine-tuning of language models by proposing RLSC, which uses the model's own confidence as reward signals, resulting in accuracy improvements of up to +21.7% on various math benchmarks with only 16 samples per question and 10-20 training steps.

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes