A Tidy Data Model for Natural Language Processing using cleanNLP
This work offers a domain-specific solution for researchers and practitioners in NLP who need efficient data preprocessing, but it is incremental as it builds on existing CoreNLP technology.
The authors tackled the problem of converting textual corpora into normalized tables for NLP by introducing the cleanNLP package, which provides fast tools using Stanford's CoreNLP library to perform various annotation tasks across multiple languages.
The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.