TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation
This addresses the challenge of assessing LLM reasoning reliability for researchers and developers, though it is incremental as it focuses on a specific dataset creation approach.
The researchers tackled the problem of evaluating LLM hallucinations on unanswerable math word problems by creating TreeCut, a synthetic dataset that systematically generates such problems, which induced hallucinations in models like GPT-4o at rates up to 64%.
Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.