DBAIJan 22

GLEAN: Grounded Lightweight Evaluation Anchors for Contamination-Aware Tabular Reasoning

arXiv:2603.02212v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses contamination and diagnostic issues in tabular reasoning evaluations for small models, though it is incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of evaluating small models on tabular reasoning tasks, which are vulnerable to contamination and dataset artifacts, by introducing GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes and structured error attribution, revealing stable error-mode separations between models like TAPEX and TAPAS.

Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints. We evaluate across TabFact, WTQ via Squall, TableBench, RobuT, and SciTab under a 16GB GPU budget. Using Squall gold SQL as an executable anchor (95.2% execution), GLEAN assigns a deterministic error taxonomy (L0-L4 plus L0.5 context miss) and reveals a stable error-mode separation: TAPEX errors skew toward grounding (L3) while TAPAS errors skew toward hallucination/abstention (L2/L0). We validate evidence-row heuristics against SQL-derived rows on simple queries (0.62 precision / 0.71 recall; hybrid recall 0.81) and show that retrieval Recall@K can saturate even when end-to-end EM/F1 remains limited, motivating attribution beyond raw recall. We release a modular framework with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes