Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
This addresses the data scarcity problem for information extraction researchers by enabling IE models to benefit from LLM advancements without extra manual effort, though it is incremental as it builds on existing LLM paradigms.
The paper tackles the challenge of scaling information extraction (IE) models by reframing next-token prediction as extraction, enabling them to leverage large language model (LLM) data resources; the result is Cuckoo, a model that achieves better performance than existing pre-trained IE models in few-shot settings.
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.