Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation
For mechanistic interpretability researchers, this work shows that LLMs represent not only labels but also the semantic operations producing them, suggesting that analysis should focus on operations rather than labels.
The paper investigates whether LLMs encode semantic operations (e.g., negation, addition) beyond just label information in natural language inference. Using SVD and activation steering, they find that transformation effects are decodable with 84.8-99% accuracy and causally influence predictions, though steerability varies across models.
Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.