Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations
This work addresses a gap in interpretability research for AI researchers, showing that probing for syntax is not a reliable indicator of model performance, which is incremental as it builds on existing probing methods.
The study tackled the problem of whether syntactic features extracted via probing from large language models predict their performance on downstream syntactic tasks, finding that probing accuracy does not reliably predict outcomes across English linguistic phenomena.
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.