CLAug 17, 2025

ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models

Yuanfeng Xu, Zehui Dai, Jian Liang, Jiapeng Guan, Guangrun Wang, Liang Lin, Xiaohui Lv

arXiv:2508.12387v12.7h-index: 4

Originality Incremental advance

AI Analysis

This work addresses the challenge of making SLMs more robust and self-sufficient for reasoning tasks, which is important for cost-effective AI applications, though it appears incremental as it builds on existing methods like reinforcement learning and chain-of-thought distillation.

The paper tackles the problem of small language models (SLMs) struggling with complex reasoning due to limited capacity and inconsistent answers, and introduces ReaLM, a reinforcement learning framework that enhances reasoning capability, autonomy, and generalization, achieving significant performance improvements on vertical and general reasoning tasks.

Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs), but often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers during multi-step reasoning. Existing efforts have improved SLM performance, but typically at the cost of one or more of three key aspects: (1) reasoning capability, due to biased supervision that filters out negative reasoning paths and limits learning from errors; (2) autonomy, due to over-reliance on externally generated reasoning signals; and (3) generalization, which suffers when models overfit to teacher-specific patterns. In this paper, we introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains. To enhance reasoning capability, we propose Multi-Route Process Verification (MRPV), which contrasts both positive and negative reasoning paths to extract decisive patterns. To reduce reliance on external guidance and improve autonomy, we introduce Enabling Autonomy via Asymptotic Induction (EAAI), a training strategy that gradually fades external signals. To improve generalization, we apply guided chain-of-thought distillation to encode domain-specific rules and expert knowledge into SLM parameters, making them part of what the model has learned. Extensive experiments on both vertical and general reasoning tasks demonstrate that ReaLM significantly improves SLM performance across aspects (1)-(3) above.

View on arXiv PDF

Similar