AICLLGFeb 2

Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

arXiv:2602.02313v23 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the interpretability and control of LLM reasoning for researchers and practitioners, though it is incremental as it builds on existing outcome-oriented and sequential-influence-aware principles.

The paper tackles the problem of interpreting and controlling the internal mechanisms of large language models (LLMs) during complex reasoning, proposing the Integrated Policy Gradient (IPG) framework to attribute reasoning behaviors to model components, which achieves more precise localization and reliable modulation of reasoning capabilities.

Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes