Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses
This work addresses the limitation of LLMs in abstract narrative reasoning for AI and NLP researchers, though it is incremental as it builds on existing CoT methods.
The study tackled the problem of assessing narrative reasoning in large language models using tropes in movie synopses, finding low performance and showing that chain-of-thought prompting can cause hallucinations, but introduced methods that boosted the F1 score by 11.8 points.
Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.