It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge
This addresses the problem of LLMs failing to handle ambiguous references effectively in conversations, which is incremental as it builds on existing fine-tuning methods.
The study investigated whether Large Language Models (LLMs) can use commonsense knowledge to resolve referential ambiguity in conversations, finding that current models struggle by committing to single interpretations or covering all references, with performance worsening under simplification prompts, but fine-tuning Llama-3.1-8B improved ambiguity resolution across request types.
Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.