DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning
For LLM practitioners and data warehouse users, this benchmark reveals limitations in complex multi-hop reasoning.
DW-Bench evaluates LLMs on graph-topology reasoning over data warehouse schemas with FK and data-lineage edges. Tool-augmented methods outperform static ones but plateau on hard compositional subtypes.
This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.