Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations
This work provides a method for analyzing linguistic complexity in LLMs, which could aid researchers in understanding model internals, though it is incremental in applying existing ID techniques to new linguistic phenomena.
The study investigated whether the intrinsic dimension (ID) of LLM representations can serve as a marker for different types of linguistic complexity, finding that ID profiles effectively distinguish formal complexity (e.g., multiple clauses) and functional contrasts (e.g., branching vs. embedding), with formal differences aligning with known abstract processing phases.
We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.