CLAIMar 7, 2025

Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning

arXiv:2503.05188v12 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable reasoning in large language models for tasks requiring step-by-step logic, though it appears incremental as it builds on existing CoT methods.

The paper tackles the problem of analyzing and improving chain-of-thought (CoT) prompting for LLM reasoning by identifying factors like problem difficulty and information flow that affect its effectiveness, and addressing unfaithful CoT where models recall correct information from questions not in the CoT, proposing an algorithm that enhances both faithfulness and effectiveness.

Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes