CLJul 8, 2025

DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Nicholas Popovič, Ashish Kangen, Tim Schopf, Michael Färber

arXiv:2507.05997v16.71 citationsh-index: 8Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Originality Incremental advance

AI Analysis

This addresses the problem of limited annotated data for researchers and practitioners in document-level information extraction, but it is incremental as it builds on existing in-context learning methods.

The paper tackled the scarcity of annotated corpora for document-level entity and relation extraction by developing a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning, producing a dataset of over 5k Wikipedia abstracts with approximately 59k entities and 30k relation triples, and found that in-context joint extraction remains challenging for state-of-the-art models.

Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.

View on arXiv PDF

Similar