Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
This addresses the problem of improved reasoning in non-English languages for AI systems, though it is incremental as it builds on existing prompting methods.
The paper tackles the challenge of multilingual reasoning in large language models by evaluating Program-of-Thought prompting, showing that fine-tuning it substantially enhances performance over Chain-of-Thought, with a strong correlation between reasoning quality and answer accuracy.
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.