TabReX : Tabular Referenceless eXplainable Evaluation
This addresses the challenge of trustworthy, explainable evaluation for structured generation systems, particularly in tabular data, by providing a novel, human-aligned metric that generalizes across domains without fixed references.
The authors tackled the problem of evaluating tables generated by large language models by proposing TabReX, a reference-less framework that uses graph-based reasoning to assess structural and factual fidelity, achieving the highest correlation with expert rankings and enabling fine-grained analysis.
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.