CLMar 27, 2023

Unified Text Structuralization with Instruction-tuned Language Models

arXiv:2303.14956v23.615 citationsh-index: 29

Originality Incremental advance

AI Analysis

This work addresses the shortage of annotated datasets and lack of generalization in information extraction, particularly benefiting low-resource and domain-specific fields like finance and law.

The authors tackled the problem of text structuralization by proposing an instruction-tuned LLM approach to extract various structures from texts, achieving performance comparable to state-of-the-art methods across multiple languages and domains.

Text structuralization is one of the important fields of natural language processing (NLP) consists of information extraction (IE) and structure formalization. However, current studies of text structuralization suffer from a shortage of manually annotated high-quality datasets from different domains and languages, which require specialized professional knowledge. In addition, most IE methods are designed for a specific type of structured data, e.g., entities, relations, and events, making them hard to generalize to others. In this work, we propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts. More concretely, we add a prefix and a suffix instruction to indicate the desired IE task and structure type, respectively, before feeding the text into a LLM. Experiments on two LLMs show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge, and can generalize to other IE sub-tasks via changing the content of instruction. Another benefit of our approach is that it can help researchers to build datasets in low-source and domain-specific scenarios, e.g., fields in finance and law, with low cost.

View on arXiv PDF

Similar