CLApr 1

From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

arXiv:2604.0077876.81 citations
Predicted impact top 79% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a fundamental weakness in LLMs for tasks requiring reliable symbolic reasoning, which is incremental as it builds on known limitations but provides new mechanistic insights.

The paper tackled the problem of why large language models (LLMs) fail on simple symbolic tasks like character counting, finding that models internally compute correct answers but later layers suppress this information due to negative circuits, leading to incorrect outputs. The result shows that these failures are due to structured interference within the model's computation graph, not missing representations, and that scaling can worsen errors.

Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes