CLJan 23

Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data

arXiv:2601.17232v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the critical gap in automated fact-checking benchmarks for real-world, high-volume structured data, providing a new resource for an unsolved problem.

The authors tackled the problem of automated fact-checking against high-volume tabular data by introducing a large-scale multilingual dataset of 78,503 synthetic claims grounded in 434 complex OECD tables averaging over 500K rows each, and demonstrated through knowledge-probing experiments that LLMs have not memorized these facts, forcing genuine retrieval and reasoning.

Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes