Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation
For practitioners working with heterogeneous tables, NAVI offers a pretraining framework that better handles varying headers and shared semantics, though improvements are shown only on in-domain data.
NAVI improves heterogeneous tabular representation by treating header-value pairs as units for structural and distributional evidence, achieving better reconstruction, semantic consistency, and downstream utility on in-domain tables.
Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.