Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
This addresses a retrieval gap in legal QA for regulatory settings, though it is incremental as it focuses on a specific domain (fire-safety regulations).
The paper tackles the problem of statute-centric legal QA, where conventional retrievers fail with hierarchically linked documents, by introducing SearchFireSafety, a benchmark that evaluates hierarchical retrieval and safety. Results show graph-guided retrieval improves performance but reveals domain-adapted models are more likely to hallucinate when evidence is missing.
Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.