CLJan 29

Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

arXiv:2601.21699v1h-index: 6
Originality Highly original
AI Analysis

This addresses the challenge of deploying efficient reasoning agents in resource-limited environments, offering a practical solution for scenarios where large models are infeasible.

The paper tackles the problem of enabling small language models to perform multi-hop reasoning under resource constraints, achieving strong performance on six benchmarks with agents up to 1.5B parameters trained on only four RTX 3090 GPUs.

While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes