Refining and Reusing Annotation Guidelines for LLM Annotation
For NLP practitioners using LLMs for annotation, this work offers a method to improve zero-shot annotation accuracy without manual data labeling.
The paper proposes an iterative moderation framework that refines and reuses annotation guidelines to align LLMs with gold-standard conventions, achieving improved performance on biomedical NER tasks across three LLM families.
While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.