CLMay 23, 2023

Schema-Driven Information Extraction from Heterogeneous Tables

arXiv:2305.14336v532 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the challenge of extracting structured information from diverse tables for researchers and practitioners in fields like machine learning, chemistry, and material science, offering a practical and efficient solution.

The paper tackles the problem of cost-efficient information extraction from heterogeneous tables by introducing schema-driven information extraction, where large language models transform tabular data into structured records based on human-authored schemas, achieving F1 scores from 74.2 to 96.1 without task-specific pipelines or labels.

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes