CLDec 27, 2025

Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Hadi Mohammadi, Tamas Kozak, Anastasia Giachanou

arXiv:2512.22631v16.72 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses reliability issues in CoT-based methods for safety supervision and alignment monitoring in AI, though it appears incremental as it compares existing optimization techniques.

The paper tackled the problem of unfaithful chain-of-thought reasoning in large language models, where explanations may not reflect actual reasoning, and found that Group Relative Policy Optimization (GRPO) outperforms Direct Preference Optimization (DPO) in improving faithfulness, with the Qwen2.5-14B-Instruct model achieving the best results across metrics.

Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.

View on arXiv PDF

Similar