AINov 9, 2025

Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

arXiv:2511.06168v12 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving human-model alignment in Chain-of-Thought reasoning for AI researchers and practitioners, representing an incremental advancement with specific optimizations.

The paper tackles the problem of evaluating and optimizing reasoning consistency in Large Language Models by introducing the Alignment Score metric to quantify semantic alignment between model-generated and human-written reasoning chains, and finds that 2-hop chains achieve the highest scores while proposing a method that improves scores by an average of 29.84% for longer chains.

This paper presents a framework for evaluating and optimizing reasoning consistency in Large Language Models (LLMs) via a new metric, the Alignment Score, which quantifies the semantic alignment between model-generated reasoning chains and human-written reference chains in Chain-of-Thought (CoT) reasoning. Empirically, we find that 2-hop reasoning chains achieve the highest Alignment Score. To explain this phenomenon, we define four key error types: logical disconnection, thematic shift, redundant reasoning, and causal reversal, and show how each contributes to the degradation of the Alignment Score. Building on this analysis, we further propose Semantic Consistency Optimization Sampling (SCOS), a method that samples and favors chains with minimal alignment errors, significantly improving Alignment Scores by an average of 29.84% with longer reasoning chains, such as in 3-hop tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes