CL AISep 15, 2025

Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities

arXiv:2509.12098v1

Originality Synthesis-oriented

AI Analysis

This work addresses model selection for NER tasks by benchmarking performance on ambiguous entities, but it is incremental as it uses a small-scale dataset and focuses on comparing existing methods.

This pilot study compared six Named Entity Recognition systems on a small benchmark of ambiguous entities, finding that large language models like Gemini generally outperformed traditional NLP tools for context-sensitive entities, with Gemini achieving the highest average F1-score, while tools like Stanza were more consistent for structured tags.

This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system's output against the manually annotated gold standard dataset using F1-score. The results show that LLMs generally outperform conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza demonstrate greater consistency in structured tags such as LOCATION and DATE. We also observed variability among LLMs, particularly in handling temporal expressions and multi-word organizations. Our findings highlight that while LLMs offer improved contextual understanding, traditional tools remain competitive in specific tasks, informing model selection.

View on arXiv PDF

Similar