CLMay 20

Refining and Reusing Annotation Guidelines for LLM Annotation

arXiv:2605.2080941.7
AI Analysis

For NLP practitioners using LLMs for annotation, this work offers a method to improve zero-shot annotation accuracy without manual data labeling.

The paper proposes an iterative moderation framework that refines and reuses annotation guidelines to align LLMs with gold-standard conventions, achieving improved performance on biomedical NER tasks across three LLM families.

While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes