Enhancing LLM Metacognition via Cognitive Pairwise Training
For LLM developers, CPT offers a mid-training method to enhance reasoning reliability and uncertainty calibration without overfitting abstention behaviors.
The authors propose Cognitive Pairwise Training (CPT) to improve LLM metacognition by learning to distinguish trustworthy from flawed reasoning traces, achieving +2.2 math-average points and +5.2 abstention-F1 points over standard SFT+RL at 14B scale.
Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning--metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at https://github.com/Tsinghua-dhy/CPT.