CLOct 16, 2023

Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset

arXiv:2310.10118v3136 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the context retrieval bottleneck for NER in long documents like novels, though it is incremental as it builds on existing methods with synthetic data.

The paper tackles the problem of limited context range in transformer-based named entity recognition (NER) for long documents by generating a synthetic dataset with Alpaca to train a neural context retriever, resulting in outperforming several baselines on an English literary dataset of 40 book chapters.

While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes