Large Language Models as Span Annotators
This provides a scalable solution for researchers and practitioners needing high-quality text analysis, though it is incremental as it applies existing LLMs to a new task.
The study tackled the problem of span annotation by demonstrating that large language models (LLMs) can serve as flexible and cost-effective alternatives to human annotators, achieving inter-annotator agreement comparable to humans at a fraction of the cost across three diverse tasks.
Span annotation is the task of localizing and classifying text spans according to custom guidelines. Annotated spans can be used to analyze and evaluate high-quality texts for which single-score metrics fail to provide actionable feedback. Until recently, span annotation was limited to human annotators or fine-tuned models. In this study, we show that large language models (LLMs) can serve as flexible and cost-effective span annotation backbones. To demonstrate their utility, we compare LLMs to skilled human annotators on three diverse span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA) comparable to human annotators at a fraction of a cost per output annotation. We also manually analyze model outputs, finding that LLMs make errors at a similar rate to human annotators. We release the dataset of more than 40k model and human annotations for further research.