Emergence and Localisation of Semantic Role Circuits in LLMs
This work addresses the insufficient characterization of internal mechanisms in LLMs for researchers in interpretability and AI safety, though it is incremental in refining existing analysis methods.
The researchers tackled the problem of understanding how large language models implement semantic roles internally, and found that LLMs form compact, causally isolated circuits for semantic structure, with high attribution within few nodes and partial transfer across scales.
Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.