CLNov 25, 2022

Finetuning BERT on Partially Annotated NER Corpora

arXiv:2211.14360v12 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of reducing annotation costs for NER tasks, though it is incremental as it builds on existing BERT finetuning methods.

The paper tackles the problem of training Named Entity Recognition models on partially annotated datasets by finetuning BERT with self-supervision and label preprocessing, achieving performance with only 10% labeled entities that matches a baseline trained with 50% labeled entities.

Most Named Entity Recognition (NER) models operate under the assumption that training datasets are fully labelled. While it is valid for established datasets like CoNLL 2003 and OntoNotes, sometimes it is not feasible to obtain the complete dataset annotation. These situations may occur, for instance, after selective annotation of entities for cost reduction. This work presents an approach to finetuning BERT on such partially labelled datasets using self-supervision and label preprocessing. Our approach outperforms the previous LSTM-based label preprocessing baseline, significantly improving the performance on poorly labelled datasets. We demonstrate that following our approach while finetuning RoBERTa on CoNLL 2003 dataset with only 10% of total entities labelled is enough to reach the performance of the baseline trained on the same dataset with 50% of the entities labelled.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes