Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs
This enhances transparency for LLM researchers by revealing internal planning mechanisms, potentially enabling fine-grained control, but it is incremental as it builds on existing interpretability methods.
The study tackled the problem of understanding how large language models (LLMs) decompose and execute composite tasks internally, showing that they sequentially process subtasks layer-by-layer, with empirical confirmation on a benchmark of 15 tasks and replication on the TRACE benchmark.
We show that large language models (LLMs) exhibit an $\textit{internal chain-of-thought}$: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world $\text{TRACE}$ benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.