Towards Benchmarking Foundation Models for Tabular Data With Text
This work addresses a gap in benchmarking for researchers and practitioners working on multimodal tabular data, though it is incremental as it builds on existing tabular foundation models.
The paper tackles the lack of benchmarks for tabular foundation models that include textual data by proposing strategies to incorporate text into tabular pipelines and curating real-world datasets with meaningful text features, resulting in a benchmarking study to evaluate state-of-the-art models.
Foundation models for tabular data are rapidly evolving, with increasing interest in extending them to support additional modalities such as free-text features. However, existing benchmarks for tabular data rarely include textual columns, and identifying real-world tabular datasets with semantically rich text features is non-trivial. We propose a series of simple yet effective ablation-style strategies for incorporating text into conventional tabular pipelines. Moreover, we benchmark how state-of-the-art tabular foundation models can handle textual data by manually curating a collection of real-world tabular datasets with meaningful textual features. Our study is an important step towards improving benchmarking of foundation models for tabular data with text.