A Dataset of German Legal Documents for Named Entity Recognition
This provides a valuable resource for researchers and practitioners working on NLP in the legal domain, specifically for German federal court decisions, but it is incremental as it applies existing methods to new data.
The authors tackled the lack of a comprehensive dataset for Named Entity Recognition in German legal documents by creating a dataset with 67,000 sentences, 2 million tokens, and 54,000 manually annotated entities across 19 fine-grained classes, plus 35,000 automatically annotated time expressions.
We describe a dataset developed for Named Entity Recognition in German federal court decisions. It consists of approx. 67,000 sentences with over 2 million tokens. The resource contains 54,000 manually annotated entities, mapped to 19 fine-grained semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European legal norm, regulation, contract, court decision, and legal literature. The legal documents were, furthermore, automatically annotated with more than 35,000 TimeML-based time expressions. The dataset, which is available under a CC-BY 4.0 license in the CoNNL-2002 format, was developed for training an NER service for German legal documents in the EU project Lynx.