Memorization or Reasoning? Exploring the Idiom Understanding of LLMs
This work addresses the challenge of idiom understanding in LLMs for natural language processing applications, but it is incremental as it builds on existing research with a new dataset and analysis.
The study tackled the problem of understanding how large language models (LLMs) process idioms by introducing MIDAS, a large-scale multilingual dataset, and found that LLMs use a hybrid approach combining memorization and reasoning, particularly for compositional idioms.
Idioms have long posed a challenge due to their unique linguistic properties, which set them apart from other common expressions. While recent studies have leveraged large language models (LLMs) to handle idioms across various tasks, e.g., idiom-containing sentence generation and idiomatic machine translation, little is known about the underlying mechanisms of idiom processing in LLMs, particularly in multilingual settings. To this end, we introduce MIDAS, a new large-scale dataset of idioms in six languages, each paired with its corresponding meaning. Leveraging this resource, we conduct a comprehensive evaluation of LLMs' idiom processing ability, identifying key factors that influence their performance. Our findings suggest that LLMs rely not only on memorization, but also adopt a hybrid approach that integrates contextual cues and reasoning, especially when processing compositional idioms. This implies that idiom understanding in LLMs emerges from an interplay between internal knowledge retrieval and reasoning-based inference.