Cultural Binding Heads in Language Models
For developers of culturally aware LLMs, this work provides a mechanistic understanding of cultural binding and a steering method to improve cultural differentiation without harming general performance.
The paper identifies 2-3 mid-layer attention heads per model that causally contribute to cultural binding in LLMs, with knockout reducing binding strength by 9-23%. Steering at generation increases cultural differentiation accuracy by 1-3 percentage points while preserving neutral reasoning, revealing that models know 3-5 times more than they act upon.
LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.