CLNov 19, 2025

What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani

arXiv:2511.15886v11 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

It addresses interpretability and robustness issues in multilingual reasoning for AI researchers, but is incremental as it builds on existing attribution methods and benchmarks.

This study examined attribution patterns in multilingual Chain-of-Thought reasoning, finding that attribution scores overly focus on the final step in incorrect generations and that structured prompting improves accuracy mainly for high-resource languages, with controlled perturbations reducing model accuracy and coherence.

This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

View on arXiv PDF

Similar