CLAIDBIRJun 20, 2023

Retrieval-Based Transformer for Table Augmentation

arXiv:2306.11843v1225 citationsh-index: 30
Originality Incremental advance
AI Analysis

This addresses the time-consuming data preparation efforts for data analysts by automating table augmentation from data lakes, though it appears incremental as it builds on transformer-based models with a retrieval and self-training strategy.

The paper tackles the problem of automatic data wrangling for table augmentation tasks like row/column population and data imputation, introducing a retrieval augmented self-trained transformer model that consistently and substantially outperforms existing methods on benchmarks such as EntiTables and WebTables.

Data preparation, also called data wrangling, is considered one of the most expensive and time-consuming steps when performing analytics or building machine learning models. Preparing data typically involves collecting and merging data from complex heterogeneous, and often large-scale data sources, such as data lakes. In this paper, we introduce a novel approach toward automatic data wrangling in an attempt to alleviate the effort of end-users, e.g. data analysts, in structuring dynamic views from data lakes in the form of tabular data. We aim to address table augmentation tasks, including row/column population and data imputation. Given a corpus of tables, we propose a retrieval augmented self-trained transformer model. Our self-learning strategy consists in randomly ablating tables from the corpus and training the retrieval-based model to reconstruct the original values or headers given the partial tables as input. We adopt this strategy to first train the dense neural retrieval model encoding table-parts to vectors, and then the end-to-end model trained to perform table augmentation tasks. We test on EntiTables, the standard benchmark for table augmentation, as well as introduce a new benchmark to advance further research: WebTables. Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes