CLAIDBIRLGFeb 22, 2024

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

arXiv:2402.14710v320 citationsh-index: 32Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited and fragmented datasets for Information Extraction, providing a standardized resource for the NLP community, though it is incremental as it builds on existing datasets.

The authors tackled the performance gap of Large Language Models in Information Extraction by constructing IEPile, a large-scale bilingual instruction corpus of about 0.32B tokens, which improved LLMs' performance with notable gains in zero-shot generalization.

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes