RACT: Retrieval Augmented Column-Table Learning and Prediction for Multi-Table Schema Matching
For data integration practitioners, this provides a method to handle heterogeneous schema designs where columns with similar meaning reside in different table contexts.
Schema matching for multi-table schemas is improved by exploiting referential context via a self-supervised retrieval framework, achieving up to +70% improvement in matching precision and completeness over similarity-based baselines.
Schema matching, a critical task for integrating data from diverse sources, seeks to identify correspondences between columns across different schemas. In multi-table holistic schema matching, columns with similar semantic meaning may reside in tables with different contexts due to heterogeneous schema designs, where similarity-based techniques are inadequate. The focus of this paper is exploiting referential context into schema matching by introducing RACT learning and prediction, a self-supervised framework enabling the probabilistic retrieval of candidate tables for source columns to constrain relevant column candidates. Experiments demonstrate that this approach outperforms similarity-based baselines on matching multi-table schemas. In subsequent matching experiments, constraining the column search space via top-t tables improves both average matching precision and completeness by up to +70%.