CLApr 1

TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu

arXiv:2604.0043896.8h-index: 9Has Code

Predicted impact top 6% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of accurate reward estimation in ICRL for AI researchers and practitioners, offering a novel method that is incremental but shows strong specific gains.

The paper tackles the challenge of reward estimation in In-Context Reinforcement Learning (ICRL) by proposing TR-ICRL, a framework that uses pseudo-labels from retrieved instances to guide iterative refinement, resulting in significant performance gains such as a 21.23% average improvement on MedQA and 137.59% on AIME2024 for the Qwen2.5-7B model.

In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.

View on arXiv PDF Code

Similar