CLAISep 15, 2025

Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities

arXiv:2509.12098v1
Originality Synthesis-oriented
AI Analysis

This work addresses model selection for NER tasks by benchmarking performance on ambiguous entities, but it is incremental as it uses a small-scale dataset and focuses on comparing existing methods.

This pilot study compared six Named Entity Recognition systems on a small benchmark of ambiguous entities, finding that large language models like Gemini generally outperformed traditional NLP tools for context-sensitive entities, with Gemini achieving the highest average F1-score, while tools like Stanza were more consistent for structured tags.

This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system's output against the manually annotated gold standard dataset using F1-score. The results show that LLMs generally outperform conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza demonstrate greater consistency in structured tags such as LOCATION and DATE. We also observed variability among LLMs, particularly in handling temporal expressions and multi-word organizations. Our findings highlight that while LLMs offer improved contextual understanding, traditional tools remain competitive in specific tasks, informing model selection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes