CLDec 19, 2022

E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text

arXiv:2212.09306v148 citationsh-index: 26
Originality Synthesis-oriented
AI Analysis

This addresses the performance degradation issue for researchers and practitioners applying NER to legal documents, though it is incremental as it focuses on dataset creation.

The authors tackled the problem of named entity recognition (NER) in legal text by creating a new annotated dataset called E-NER, which improved F1-scores by 29.4% to 60.4% compared to using general English datasets.

Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4\% and 60.4\%, compared to training and testing on the E-NER collection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes