Tug-of-war between idioms' figurative and literal interpretations in LLMs
This provides mechanistic insights into idiom comprehension for researchers in NLP and cognitive science, though it is incremental as it builds on existing causal analysis methods.
The paper tackled the challenge of how large language models handle idioms with conflicting figurative and literal interpretations, using causal tracing to identify three mechanisms in transformers that manage this ambiguity, such as early layers retrieving figurative meanings and parallel pathways maintaining both interpretations.
Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom's literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve an idiom's figurative interpretation, while suppressing its literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway prioritizes the figurative interpretation and a parallel direct route favors the literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.