CLJun 16, 2025

Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach

Chaoxu Pang, Yixuan Cao, Ganbin Zhou, Hongwei Li, Ping Luo

arXiv:2506.13328v11 citationsh-index: 12

Originality Highly original

AI Analysis

This work addresses automated numerical cross-checking in disclosure documents for ensuring accuracy and credibility, representing a novel method for a known bottleneck.

The paper tackles the problem of automatically checking numerical consistency across tables in disclosure documents by introducing CoFiTCheck, a coarse-to-fine LLM-based framework that addresses combinatorial explosion and semantic comprehension challenges, achieving significant performance improvements over previous methods while maintaining practical efficiency.

Numerical consistency across tables in disclosure documents is critical for ensuring accuracy, maintaining credibility, and avoiding reputational and economic risks. Automated tabular numerical cross-checking presents two significant challenges: (C1) managing the combinatorial explosion of candidate instances at the document level and (C2) comprehending multi-faceted numerical semantics. Previous research typically depends on heuristic-based filtering or simplified context extraction, often struggling to balance performance and efficiency. Recently, large language models (LLMs) have demonstrated remarkable contextual understanding capabilities that helps address C2 at the instance level, yet they remain hampered by computational inefficiency (C1) and limited domain expertise. This paper introduces CoFiTCheck, a novel LLM-based coarse-to-fine framework that addresses these challenges through two sequential stages: embedding-based filtering and discriminative classification. The embedding-based filtering stage introduces an instructional parallel encoding method to efficiently represent all numerical mentions in a table with LLMs, as well as a decoupled InfoNCE objective to mitigate the isolated mention problem. The discriminative classification stage employs a specialized LLM for fine-grained analysis of the remaining candidate pairs. This stage is further enhanced by our crosstable numerical alignment pretraining paradigm, which leverages weak supervision from cross-table numerical equality relationships to enrich task-specific priors without requiring manual annotation. Comprehensive evaluation across three types of real-world disclosure documents demonstrates that CoFiTCheck significantly outperforms previous methods while maintaining practical efficiency.

View on arXiv PDF

Similar